# Assignment 4: Recurrent Neural Network Language Model

This is the "working notebook", with skeleton code to load and train your model, as well as run unit tests. See [rnnlm-instructions.ipynb](rnnlm-instructions.ipynb) for the main writeup.

Run the cell below to import packages.

In [1]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import collections, itertools
import unittest
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import tensorflow as tf
assert(tf.__version__.startswith("1."))

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz

# Your code
import rnnlm; reload(rnnlm)
import rnnlm_test; reload(rnnlm_test)

  return f(*args, **kwds)


<module 'rnnlm_test' from '/home/yeunghoman/w266/assignment/a4/rnnlm_test.py'>

## (a) RNNLM Inputs and Parameters

### Answers for Part (a)
You can use LaTeX to typeset math, e.g. `$ f(x) = x^2 $` will render as $ f(x) = x^2 $.

1. Assuming that `CellFunc` does not include the `Affine Layer After RNN Output` and `W_out` or `b_out` are included in the "green" cell. Cell equation is 
$h^{(i)} = tanh(concat(h^{(i-1)},x^{(i)})*W_{cell} + B_{cell})$.`V` should not matter here. The total number of parameters for $W_{cell} = 2H^2$, for $B_cell = H$, total $2H^2 + H$.

2. $W_{in}$ of embedding layer has shape `(V by H)`, total $V*H$ parameters. $W_{out}$ of output layer has shape `(H by V)` and $B_{out}$ has shape `(V,)`, total $H*V + V$ parameters.


3.

Assuming embedding look up doesn't count as a mathematical `FLOP`. But for each single input word the look up takes `O(1)`. 


#### Within the RNN cell: 

`sum(FLOP) in MatMul = [2H + (2H-1)]*H  ---- this is O(H^2)`

`sum(FLOP) to add Bias term = H  ---- this is O(H)`

`sum(FLOP) to calculate tanh = H * 5  ---- this is O(H)`

`----- Overall, the cell func is dominated by O(H^2)`


#### After RNN output, we use this output to get V logits:

`sum(FLOP) in MatMul = [H + (H-1)]*V  ---- this is O(HV)`

`sum(FLOP) to add Bias Term = V  ---- this is O(V)`

`----- Overall, this part is dominated by O(HV)`


#### Calculate softmax :

`sum(FLOP) to exponentiate every term = V ---- this is O(V)`

`sum(FLOP) to calculate denominator = V-1 ---- this is O(V)`

`sum(FLOP) to get softmax probability for a specific word = 1 ---- this is O(1)`

`----- Overall, this part is dominated by O(V)`


#### sum(FLOP) for a Single Target Word in a specific time step 

`= [2H + (2H-1)]*H + 6H + [H + (H-1)]*V + 2V + V - 1 + 1`

`= 4H^2 + 5H + 2HV + 2V`

`--ANSWER-- Overall, this is dominated by O(H^2 + HV)`


#### sum(FLOP) for all Target Words in a specific time step, we simply add V-1 more divisions for the probabilities. Every flop up to the softmax denominator is shared
`= 4H^2 + 5H + 2HV + 3V - 1`
`--ANSWER-- Again, this is dominated by O(H^2 + HV)`

4.

### Sampled Softmax: 

`Basic idea is that we substitude V for (k+1) for everything after RNN output. Assume the sampling algorithm uses the same k samples for every operation in the same batch.`

#### After the RNN output, sampled softmax only calculate logits for the K samples.

`sum(FLOP) in MatMul = [H + (H-1)]*(k+1)  ---- this is O(Hk)`

`sum(FLOP) to add Bias Term = k+1  ---- this is O(k)`

`----- Overall, this part is dominated by O(Hk)`

#### Calculate softmax :

`sum(FLOP) to exponentiate every term = k+1 ---- this is O(k)`

`sum(FLOP) to calculate denominator = k ---- this is O(k)`

`sum(FLOP) to get softmax probability for a specific word = 1 ---- this is O(1)`

`----- Overall, this part is dominated by O(k)`

#### A Single Target Word 

`= [2H + (2H-1)]*H + 6H + [H + (H-1)]*(k+1) + 2(k+1) + k + 1`

`= 4H^2 + 5H + 2H(k+1) + 2(k+1)`

`--ANSWER-- Overall, this is dominated by O(H^2 + Hk)`

#### All Target Words. We use the same k samples and simply add V-1 more divisions for the probabilities. There may be a slight complication/option that the denominator need to be updated with the current target logit. Every flop up to the softmax denominator is shared.

`= [2H + (2H-1)]*H + 6H + [H + (H-1)]*(k+1) + 2(k+1) + (k - 1 + V) + V`

`= 4H^2 + 5H + 2H(k+1) + 2k + 2V `

`--ANSWER-- Overall, this is dominated by O(H^2 + Hk + V)`



### Hierarchical softmax: Basic idea is that we take the RNN output O and take a maximum of log(V) tree splits until we reach the bottom.

`sum(FLOP) for each split = [H + (H-1)]*1 + 1  ---- this is O(H)`
`sum(FLOP) for all splits down the tree = log(V)*(2H) ---- this is O(H*log(V))`

#### A Single Target Word 

`= [2H + (2H-1)]*H + 6H + log(V)*(2H)`

`= 4H^2 + 5H + log(V)*(2H)`

`--ANSWER-- Overall, this is dominated by O(H^2 + H*log(V))`

#### All Target Words. We walk down the tree V times.

`= 4H^2 + 5H + log(V)*(2H)*V`

`--ANSWER-- Overall, this is dominated by O(H^2 + V*H*log(V))`

5.

`In training, we propagate both forward and backward. The recurrent layer would require largest number of forward flops and largest number of backward parameter updates.`

#### Forward:
`Embedding Layer Look up is O(1)`
`Recurrent Layer sum(FLOP) is O(H^2), H^2 = 200*200`
`Output Layer sum(FLOP) is O(Hk), Hk = 200*100`

#### Backward: 
`Embedding Layer updates (only one input word vector) = 200`
`Recurrent Layer updates = 400*200 + 200`
`Output Layer updates (we substitute V by k=100) = 200*100 + 100`

## (b) Implementing the RNNLM

In order to better manage the model parameters, we'll implement our RNNLM in the `RNNLM` class in `rnnlm.py`. We've given you a skeleton of starter code for this, but the bulk of the implementation is left to you.

In [2]:
reload(rnnlm)

TF_GRAPHDIR = "/tmp/w266/a4_graph"

# Clear old log directory.
shutil.rmtree(TF_GRAPHDIR, ignore_errors=True)

lm = rnnlm.RNNLM(V=10000, H=200, num_layers=2)
lm.BuildCoreGraph()
lm.BuildTrainGraph()
lm.BuildSamplerGraph()

summary_writer = tf.summary.FileWriter(TF_GRAPHDIR, lm.graph)

The code above will load your implementation, construct the graph, and write a logdir for TensorBoard. You can bring up TensorBoard with:
```
cd assignment/a4
tensorboard --logdir /tmp/w266/a4_graph --port 6006
```
As usual, check http://localhost:6006/ and visit the "Graphs" tab to inspect your implementation. Remember, judicious use of `tf.name_scope()` and/or `tf.variable_scope()` will greatly improve the visualization, and make code easier to debug.

We've provided a few unit tests below to verify some *very* basic properties of your model.

In [3]:
reload(rnnlm); reload(rnnlm_test)
utils.run_tests(rnnlm_test, ["TestRNNLMCore", "TestRNNLMTrain", "TestRNNLMSampler"])

test_shapes_embed (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_output (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_recurrent (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_train (rnnlm_test.TestRNNLMTrain) ... ok
test_shapes_sample (rnnlm_test.TestRNNLMSampler) ... ok

----------------------------------------------------------------------
Ran 5 tests in 1.469s

OK


Note that the error messages are intentionally somewhat spare, and that passing tests are no guarantee of model correctness! Your best chance of success is through careful coding and understanding of how the model works.

## (c) Training your RNNLM (5 points)

We'll give you data loader functions in **`utils.py`**. They work similarly to the loaders in the Week 5 notebook.

Particularly, `utils.rnnlm_batch_generator` will return an iterator that yields minibatches in the correct format. Batches will be of size `[batch_size, max_time]`, and consecutive batches will line up along rows so that the final state $h^{\text{final}}$ of one batch can be used as the initial state $h^{\text{init}}$ for the next.

For example, using a toy corpus:  
*(Ignore the ugly formatter code.)*

In [57]:
toy_corpus = "<s> Mary had a little lamb . <s> The lamb was white as snow . <s>"
toy_corpus = np.array(toy_corpus.split())

html = "<h3>Input words w:</h3>"
html += "<table><tr><th>Batch 0</th><th>Batch 1</th></tr><tr>"
bi = utils.rnnlm_batch_generator(toy_corpus, batch_size=2, max_time=4)
for i, (w,y) in enumerate(bi):
    cols = ["w_%d" % d for d in range(w.shape[1])]  #w_1, w_2,.... time step in a batch
    html += "<td>{:s}</td>".format(utils.render_matrix(w, cols=cols, dtype=object))
html += "</tr></table>"
display(HTML(html))

html = "<h3>Target words y:</h3>"
html += "<table><tr><th>Batch 0</th><th>Batch 1</th></tr><tr>"
bi = utils.rnnlm_batch_generator(toy_corpus, batch_size=2, max_time=4)
for i, (w,y) in enumerate(bi):
    cols = ["y_%d" % d for d in range(y.shape[1])]
    html += "<td>{:s}</td>".format(utils.render_matrix(y, cols=cols, dtype=object))
html += "</tr></table>"
display(HTML(html))

Unnamed: 0_level_0,w_0,w_1,w_2,w_3
Unnamed: 0_level_1,w_0,w_1,w_2,Unnamed: 4_level_1
0,<s>,Mary,had,a
1,<s>,The,lamb,was
0,little,lamb,.,
1,white,as,snow,
Batch 0,Batch 1,,,
w_0  w_1  w_2  w_3  0  <s>  Mary  had  a  1  <s>  The  lamb  was,w_0  w_1  w_2  0  little  lamb  .  1  white  as  snow,,,

Unnamed: 0,w_0,w_1,w_2,w_3
0,<s>,Mary,had,a
1,<s>,The,lamb,was

Unnamed: 0,w_0,w_1,w_2
0,little,lamb,.
1,white,as,snow


Unnamed: 0_level_0,y_0,y_1,y_2,y_3
Unnamed: 0_level_1,y_0,y_1,y_2,Unnamed: 4_level_1
0,Mary,had,a,little
1,The,lamb,was,white
0,lamb,.,<s>,
1,as,snow,.,
Batch 0,Batch 1,,,
y_0  y_1  y_2  y_3  0  Mary  had  a  little  1  The  lamb  was  white,y_0  y_1  y_2  0  lamb  .  <s>  1  as  snow  .,,,

Unnamed: 0,y_0,y_1,y_2,y_3
0,Mary,had,a,little
1,The,lamb,was,white

Unnamed: 0,y_0,y_1,y_2
0,lamb,.,<s>
1,as,snow,.


Note that the data we feed to our model will be word indices, but the shape will be the same.

### 1. Implement the `run_epoch` function
We've given you some starter code for logging progress; fill this in with actual call(s) to `session.run` with the appropriate arguments to run a training step. 

Be sure to handle the initial state properly at the beginning of an epoch, and remember to carry over the final state from each batch and use it as the initial state for the next.

**Note:** we provide a `train=True` flag to enable train mode. If `train=False`, this function can also be used for scoring the dataset - see `score_dataset()` below.

In [4]:
def run_epoch(lm, session, batch_iterator,
              train=False, verbose=False,
              tick_s=10, learning_rate=0.01):
    start_time = time.time()
    tick_time = start_time  # for showing status
    total_cost = 0.0  # total cost, summed over all words
    total_batches = 0
    total_words = 0

    if train:
        train_op = lm.train_step_
        use_dropout = True
        loss = lm.train_loss_
    else:
        train_op = tf.no_op()
        use_dropout = False  # no dropout at test time
        loss = lm.loss_  # true loss, if train_loss is an approximation

    for i, (w, y) in enumerate(batch_iterator):
        cost = 0.0
        # At first batch in epoch, get a clean intitial state.
        if i == 0:
            h = session.run(lm.initial_h_, {lm.input_w_: w})

        #### YOUR CODE HERE ####
        feed_dict = {lm.use_dropout_:use_dropout,
                     lm.input_w_:w,
                     lm.target_y_:y,
                     lm.learning_rate_:learning_rate,
                     lm.initial_h_:h}
        
        cost, h, _ = session.run([loss, lm.final_h_ ,train_op], feed_dict=feed_dict)
        #### END(YOUR CODE) ####
        total_cost += cost
        total_batches = i + 1
        total_words += w.size  # w.size = batch_size * max_time

        ##
        # Print average loss-so-far for epoch
        # If using train_loss_, this may be an underestimate.
        if verbose and (time.time() - tick_time >= tick_s):
            avg_cost = total_cost / total_batches
            avg_wps = total_words / (time.time() - start_time)
            print("[batch {:d}]: seen {:d} words at {:.1f} wps, loss = {:.3f}".format(
                i, total_words, avg_wps, avg_cost))
            tick_time = time.time()  # reset time ticker

    return total_cost / total_batches

In [5]:
def score_dataset(lm, session, ids, name="Data"):
    # For scoring, we can use larger batches to speed things up.
    bi = utils.rnnlm_batch_generator(ids, batch_size=100, max_time=100)
    cost = run_epoch(lm, session, bi, 
                     learning_rate=0.01, train=False, 
                     verbose=False, tick_s=3600)
    print("{:s}: avg. loss: {:.03f}  (perplexity: {:.02f})".format(name, cost, np.exp(cost)))
    return cost

In [7]:
reload(rnnlm); reload(rnnlm_test)
th = rnnlm_test.RunEpochTester("test_toy_model")
th.setUp(); th.injectCode(run_epoch, score_dataset)
unittest.TextTestRunner(verbosity=2).run(th)

test_toy_model (rnnlm_test.RunEpochTester) ... 

[batch 9]: seen 500 words at 498.5 wps, loss = 1.643
[batch 115]: seen 5800 words at 2883.5 wps, loss = 0.892
[batch 225]: seen 11300 words at 3747.7 wps, loss = 1.021
[batch 335]: seen 16800 words at 4179.6 wps, loss = 1.060
[batch 445]: seen 22300 words at 4440.4 wps, loss = 1.024
[batch 556]: seen 27850 words at 4620.4 wps, loss = 0.948
[batch 665]: seen 33300 words at 4731.6 wps, loss = 0.918
[batch 775]: seen 38800 words at 4822.6 wps, loss = 0.895
[batch 885]: seen 44300 words at 4894.0 wps, loss = 0.862
[batch 995]: seen 49800 words at 4951.9 wps, loss = 0.849
[batch 1104]: seen 55250 words at 4993.8 wps, loss = 0.852
[batch 1217]: seen 60900 words at 5044.7 wps, loss = 0.838
[batch 1332]: seen 66650 words at 5098.4 wps, loss = 0.827
[batch 1443]: seen 72200 words at 5129.7 wps, loss = 0.831
[batch 1552]: seen 77650 words at 5147.9 wps, loss = 0.835
[batch 1661]: seen 83100 words at 5165.9 wps, loss = 0.837
[batch 1770]: seen 88550 words at 5180.0 wps, loss = 0.841
[batch 1877]:

ok

----------------------------------------------------------------------
Ran 1 test in 40.039s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

In [8]:
reload(rnnlm); reload(rnnlm_test)
th = rnnlm_test.RunEpochTester("test_toy_model")
th.setUp(); th.injectCode(run_epoch, score_dataset)
unittest.TextTestRunner(verbosity=2).run(th)

test_toy_model (rnnlm_test.RunEpochTester) ... 

[batch 99]: seen 5000 words at 4991.9 wps, loss = 0.956
[batch 213]: seen 10700 words at 5324.3 wps, loss = 0.763
[batch 323]: seen 16200 words at 5379.0 wps, loss = 0.773
[batch 433]: seen 21700 words at 5397.8 wps, loss = 0.817
[batch 540]: seen 27050 words at 5378.8 wps, loss = 0.835
[batch 647]: seen 32400 words at 5369.5 wps, loss = 0.837
[batch 755]: seen 37800 words at 5369.4 wps, loss = 0.842
[batch 865]: seen 43300 words at 5383.3 wps, loss = 0.866
[batch 973]: seen 48700 words at 5384.2 wps, loss = 0.883
[batch 1087]: seen 54400 words at 5411.6 wps, loss = 0.902
[batch 1197]: seen 59900 words at 5416.1 wps, loss = 0.891
[batch 1306]: seen 65350 words at 5416.1 wps, loss = 0.878
[batch 1416]: seen 70850 words at 5418.8 wps, loss = 0.865
[batch 1525]: seen 76300 words at 5419.8 wps, loss = 0.854
[batch 1632]: seen 81650 words at 5414.8 wps, loss = 0.837
[batch 1741]: seen 87100 words at 5414.6 wps, loss = 0.823
[batch 1849]: seen 92500 words at 5411.4 wps, loss = 0.833
[batch 1

ok

----------------------------------------------------------------------
Ran 1 test in 38.201s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

You can use the cell below to verify your implementation of `run_epoch`, and to test your RNN on a (very simple) toy dataset:

Note that as above, this is a *very* simple test case that does not guarantee model correctness.

### 2. Run Training

We'll give you the outline of the training procedure, but you'll need to fill in a call to your `run_epoch` function. 

At the end of training, we use a `tf.train.Saver` to save a copy of the model to `/tmp/w266/a4_model/rnnlm_trained`. You'll want to load this from disk to work on later parts of the assignment; see **part (d)** for an example of how this is done.

#### Tuning Hyperparameters
With a sampled softmax loss, the default hyperparameters should train 5 epochs in around 15 minutes on a single-core GCE instance, and reach a training set perplexity between 120-140.

However, it's possible to do significantly better. Try experimenting with multiple RNN layers (`num_layers` > 1) or a larger hidden state - though you may also need to adjust the learning rate and number of epochs for a larger model.

You can also experiment with a larger vocabulary. This will look worse for perplexity, but will be a better model overall as it won't treat so many words as `<unk>`.

#### Notes on Speed

To speed things up, you may want to re-start your GCE instance with more CPUs. Using a 16-core machine will train *very* quickly if using a sampled softmax lost, almost as fast as a GPU. (Because of the sequential nature of the model, GPUs aren't actually much faster than CPUs for training and running RNNs.) The training code will print the words-per-second processed; with the default settings on a single core, you can expect around 8000 WPS, or up to more than 25000 WPS on a fast multi-core machine.

You might also want to modify the code below to only run score_dataset at the very end, after all epochs are completed. This will speed things up significantly, since `score_dataset` uses the full softmax loss - and so often can take longer than a whole training epoch!

#### Submitting your model
You should submit your trained model along with the assignment. Do:
```
cp /tmp/w266/a4_model/rnnlm_trained* .
git add rnnlm_trained*
git commit -m "Adding trained model."
```
Unless you train a very large model, these files should be < 50 MB and no problem for git to handle. If you do also train a large model, please only submit the smaller one.

### Original Setting

In [6]:
# Load the dataset
V = 10000
vocab, train_ids, test_ids = utils.load_corpus("brown", split=0.8, V=V, shuffle=42)

[nltk_data] Downloading package brown to /home/yeunghoman/nltk_data...
[nltk_data]   Package brown is already up-to-date!
Vocabulary: 10,000 types
Loaded 57,340 sentences (1.16119e+06 tokens)
Training set: 45,872 sentences (924,077 tokens)
Test set: 11,468 sentences (237,115 tokens)


In [61]:
# Training parameters
max_time = 20
batch_size = 50
learning_rate = 0.01
num_epochs = 5

# Model parameters
model_params = dict(V=vocab.size, 
                    H=100, 
                    softmax_ns=200,
                    num_layers=1)

TF_SAVEDIR = "/tmp/w266/a4_model"
checkpoint_filename = os.path.join(TF_SAVEDIR, "rnnlm")
trained_filename = os.path.join(TF_SAVEDIR, "rnnlm_trained")

In [29]:
# Will print status every this many seconds
print_interval = 5

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildTrainGraph()

# Explicitly add global initializer and variable saver to LM graph
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()
    saver = tf.train.Saver()
    
# Clear old log directory
shutil.rmtree(TF_SAVEDIR, ignore_errors=True)
if not os.path.isdir(TF_SAVEDIR):
    os.makedirs(TF_SAVEDIR)

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(42)

    session.run(initializer)

    for epoch in range(1,num_epochs+1):
        t0_epoch = time.time()
        bi = utils.rnnlm_batch_generator(train_ids, batch_size, max_time)
        print("[epoch {:d}] Starting epoch {:d}".format(epoch, epoch))
        #### YOUR CODE HERE ####
        # Run a trainingtick_s epoch.
        run_epoch(lm, session, bi, learning_rate=0.01, train=True, verbose=True, tick_s=3600)
        
        #### END(YOUR CODE) ####
        print("[epoch {:d}] Completed in {:s}".format(epoch, utils.pretty_timedelta(since=t0_epoch)))
    
        # Save a checkpoint
        saver.save(session, checkpoint_filename, global_step=epoch)
    
        ##
        # score_dataset will run a forward pass over the entire dataset
        # and report perplexity scores. This can be slow (around 1/2 to 
        # 1/4 as long as a full epoch), so you may want to comment it out
        # to speed up training on a slow machine. Be sure to run it at the 
        # end to evaluate your score.
        if epoch == num_epochs:
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, train_ids, name="Train set")
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, test_ids, name="Test set")
            print("")
    
    # Save final model
    saver.save(session, trained_filename)

[epoch 1] Starting epoch 1
[epoch 1] Completed in 0:02:32
[epoch 2] Starting epoch 2
[epoch 2] Completed in 0:02:31
[epoch 3] Starting epoch 3
[epoch 3] Completed in 0:02:27
[epoch 4] Starting epoch 4
[epoch 4] Completed in 0:02:28
[epoch 5] Starting epoch 5
[epoch 5] Completed in 0:02:31
[epoch 5] Train set: avg. loss: 4.847  (perplexity: 127.37)
[epoch 5] Test set: avg. loss: 4.982  (perplexity: 145.72)



### Increasing epochs to 20 (saved as final model)

In [10]:
# Training parameters
max_time = 20
batch_size = 50
learning_rate = 0.01
num_epochs = 20

# Model parameters
model_params = dict(V=vocab.size, 
                    H=100, 
                    softmax_ns=200,
                    num_layers=1)

TF_SAVEDIR = "/tmp/w266/a4_model"
checkpoint_filename = os.path.join(TF_SAVEDIR, "rnnlm")
trained_filename = os.path.join(TF_SAVEDIR, "rnnlm_trained")

In [11]:
# Will print status every this many seconds
print_interval = 5

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildTrainGraph()

# Explicitly add global initializer and variable saver to LM graph
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()
    saver = tf.train.Saver()
    
# Clear old log directory
shutil.rmtree(TF_SAVEDIR, ignore_errors=True)
if not os.path.isdir(TF_SAVEDIR):
    os.makedirs(TF_SAVEDIR)

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(42)

    session.run(initializer)

    for epoch in range(1,num_epochs+1):
        t0_epoch = time.time()
        bi = utils.rnnlm_batch_generator(train_ids, batch_size, max_time)
        print("[epoch {:d}] Starting epoch {:d}".format(epoch, epoch))
        #### YOUR CODE HERE ####
        # Run a trainingtick_s epoch.
        run_epoch(lm, session, bi, learning_rate=0.01, train=True, verbose=True, tick_s=3600)
        
        #### END(YOUR CODE) ####
        print("[epoch {:d}] Completed in {:s}".format(epoch, utils.pretty_timedelta(since=t0_epoch)))
    
        # Save a checkpoint
        saver.save(session, checkpoint_filename, global_step=epoch)
    
        ##
        # score_dataset will run a forward pass over the entire dataset
        # and report perplexity scores. This can be slow (around 1/2 to 
        # 1/4 as long as a full epoch), so you may want to comment it out
        # to speed up training on a slow machine. Be sure to run it at the 
        # end to evaluate your score.
        if epoch == num_epochs:
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, train_ids, name="Train set")
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, test_ids, name="Test set")
            print("")
    
    # Save final model
    saver.save(session, trained_filename)

[epoch 1] Starting epoch 1
[epoch 1] Completed in 0:02:32
[epoch 2] Starting epoch 2
[epoch 2] Completed in 0:02:34
[epoch 3] Starting epoch 3
[epoch 3] Completed in 0:02:28
[epoch 4] Starting epoch 4
[epoch 4] Completed in 0:02:29
[epoch 5] Starting epoch 5
[epoch 5] Completed in 0:02:26
[epoch 6] Starting epoch 6
[epoch 6] Completed in 0:02:28
[epoch 7] Starting epoch 7
[epoch 7] Completed in 0:02:27
[epoch 8] Starting epoch 8
[epoch 8] Completed in 0:02:31
[epoch 9] Starting epoch 9
[epoch 9] Completed in 0:02:31
[epoch 10] Starting epoch 10
[epoch 10] Completed in 0:02:27
[epoch 11] Starting epoch 11
[epoch 11] Completed in 0:02:35
[epoch 12] Starting epoch 12
[epoch 12] Completed in 0:02:34
[epoch 13] Starting epoch 13
[epoch 13] Completed in 0:02:31
[epoch 14] Starting epoch 14
[epoch 14] Completed in 0:02:32
[epoch 15] Starting epoch 15
[epoch 15] Completed in 0:02:26
[epoch 16] Starting epoch 16
[epoch 16] Completed in 0:02:26
[epoch 17] Starting epoch 17
[epoch 17] Completed i

### In addition, increase hidden layer nodes to 150

In [13]:
# Training parameters
max_time = 20
batch_size = 50
learning_rate = 0.01
num_epochs = 20

# Model parameters
model_params = dict(V=vocab.size, 
                    H=150, 
                    softmax_ns=200, #k
                    num_layers=1)

TF_SAVEDIR = "/tmp/w266/a4_model"
checkpoint_filename = os.path.join(TF_SAVEDIR, "rnnlm")
trained_filename = os.path.join(TF_SAVEDIR, "rnnlm_trained")

In [14]:
# Will print status every this many seconds
print_interval = 5

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildTrainGraph()

# Explicitly add global initializer and variable saver to LM graph
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()
    saver = tf.train.Saver()
    
# Clear old log directory
shutil.rmtree(TF_SAVEDIR, ignore_errors=True)
if not os.path.isdir(TF_SAVEDIR):
    os.makedirs(TF_SAVEDIR)

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(42)

    session.run(initializer)

    for epoch in range(1,num_epochs+1):
        t0_epoch = time.time()
        bi = utils.rnnlm_batch_generator(train_ids, batch_size, max_time)
        print("[epoch {:d}] Starting epoch {:d}".format(epoch, epoch))
        #### YOUR CODE HERE ####
        # Run a trainingtick_s epoch.
        run_epoch(lm, session, bi, learning_rate=0.01, train=True, verbose=True, tick_s=3600)
        
        #### END(YOUR CODE) ####
        print("[epoch {:d}] Completed in {:s}".format(epoch, utils.pretty_timedelta(since=t0_epoch)))
    
        # Save a checkpoint
        saver.save(session, checkpoint_filename, global_step=epoch)
    
        ##
        # score_dataset will run a forward pass over the entire dataset
        # and report perplexity scores. This can be slow (around 1/2 to 
        # 1/4 as long as a full epoch), so you may want to comment it out
        # to speed up training on a slow machine. Be sure to run it at the 
        # end to evaluate your score.
        if epoch == num_epochs:
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, train_ids, name="Train set")
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, test_ids, name="Test set")
            print("")
    
    # Save final model
    saver.save(session, trained_filename)

[epoch 1] Starting epoch 1
[epoch 1] Completed in 0:00:47
[epoch 2] Starting epoch 2
[epoch 2] Completed in 0:00:46
[epoch 3] Starting epoch 3
[epoch 3] Completed in 0:00:46
[epoch 4] Starting epoch 4
[epoch 4] Completed in 0:00:46
[epoch 5] Starting epoch 5
[epoch 5] Completed in 0:00:46
[epoch 6] Starting epoch 6
[epoch 6] Completed in 0:00:46
[epoch 7] Starting epoch 7
[epoch 7] Completed in 0:00:46
[epoch 8] Starting epoch 8
[epoch 8] Completed in 0:00:46
[epoch 9] Starting epoch 9
[epoch 9] Completed in 0:00:46
[epoch 10] Starting epoch 10
[epoch 10] Completed in 0:00:46
[epoch 11] Starting epoch 11
[epoch 11] Completed in 0:00:46
[epoch 12] Starting epoch 12
[epoch 12] Completed in 0:00:47
[epoch 13] Starting epoch 13
[epoch 13] Completed in 0:00:47
[epoch 14] Starting epoch 14
[epoch 14] Completed in 0:00:46
[epoch 15] Starting epoch 15
[epoch 15] Completed in 0:00:46
[epoch 16] Starting epoch 16
[epoch 16] Completed in 0:00:46
[epoch 17] Starting epoch 17
[epoch 17] Completed i

### In addition, increase stack up to 2 RNN layers

In [15]:
# Training parameters
max_time = 20
batch_size = 50
learning_rate = 0.01
num_epochs = 20

# Model parameters
model_params = dict(V=vocab.size, 
                    H=150, 
                    softmax_ns=200, #k
                    num_layers=2)

TF_SAVEDIR = "/tmp/w266/a4_model"
checkpoint_filename = os.path.join(TF_SAVEDIR, "rnnlm")
trained_filename = os.path.join(TF_SAVEDIR, "rnnlm_trained")

In [16]:
# Will print status every this many seconds
print_interval = 5

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildTrainGraph()

# Explicitly add global initializer and variable saver to LM graph
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()
    saver = tf.train.Saver()
    
# Clear old log directory
shutil.rmtree(TF_SAVEDIR, ignore_errors=True)
if not os.path.isdir(TF_SAVEDIR):
    os.makedirs(TF_SAVEDIR)

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(42)

    session.run(initializer)

    for epoch in range(1,num_epochs+1):
        t0_epoch = time.time()
        bi = utils.rnnlm_batch_generator(train_ids, batch_size, max_time)
        print("[epoch {:d}] Starting epoch {:d}".format(epoch, epoch))
        #### YOUR CODE HERE ####
        # Run a trainingtick_s epoch.
        run_epoch(lm, session, bi, learning_rate=0.01, train=True, verbose=True, tick_s=3600)
        
        #### END(YOUR CODE) ####
        print("[epoch {:d}] Completed in {:s}".format(epoch, utils.pretty_timedelta(since=t0_epoch)))
    
        # Save a checkpoint
        saver.save(session, checkpoint_filename, global_step=epoch)
    
        ##
        # score_dataset will run a forward pass over the entire dataset
        # and report perplexity scores. This can be slow (around 1/2 to 
        # 1/4 as long as a full epoch), so you may want to comment it out
        # to speed up training on a slow machine. Be sure to run it at the 
        # end to evaluate your score.
        if epoch == num_epochs:
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, train_ids, name="Train set")
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, test_ids, name="Test set")
            print("")
    
    # Save final model
    saver.save(session, trained_filename)

[epoch 1] Starting epoch 1
[epoch 1] Completed in 0:01:00
[epoch 2] Starting epoch 2
[epoch 2] Completed in 0:01:00
[epoch 3] Starting epoch 3
[epoch 3] Completed in 0:01:00
[epoch 4] Starting epoch 4
[epoch 4] Completed in 0:01:00
[epoch 5] Starting epoch 5
[epoch 5] Completed in 0:01:00
[epoch 6] Starting epoch 6
[epoch 6] Completed in 0:01:00
[epoch 7] Starting epoch 7
[epoch 7] Completed in 0:01:00
[epoch 8] Starting epoch 8
[epoch 8] Completed in 0:01:00
[epoch 9] Starting epoch 9
[epoch 9] Completed in 0:01:00
[epoch 10] Starting epoch 10
[epoch 10] Completed in 0:01:00
[epoch 11] Starting epoch 11
[epoch 11] Completed in 0:01:00
[epoch 12] Starting epoch 12
[epoch 12] Completed in 0:01:00
[epoch 13] Starting epoch 13
[epoch 13] Completed in 0:01:00
[epoch 14] Starting epoch 14
[epoch 14] Completed in 0:01:00
[epoch 15] Starting epoch 15
[epoch 15] Completed in 0:01:00
[epoch 16] Starting epoch 16
[epoch 16] Completed in 0:01:00
[epoch 17] Starting epoch 17
[epoch 17] Completed i

### In addition, Larger Vocabulary = 20000

In [19]:
# Load the dataset
V = 20000
vocab, train_ids, test_ids = utils.load_corpus("brown", split=0.8, V=V, shuffle=42)

[nltk_data] Downloading package brown to /home/yeunghoman/nltk_data...
[nltk_data]   Package brown is already up-to-date!
Vocabulary: 20,000 types
Loaded 57,340 sentences (1.16119e+06 tokens)
Training set: 45,872 sentences (924,077 tokens)
Test set: 11,468 sentences (237,115 tokens)


In [20]:
# Training parameters
max_time = 20
batch_size = 50
learning_rate = 0.01
num_epochs = 20

# Model parameters
model_params = dict(V=vocab.size, 
                    H=150, 
                    softmax_ns=200, #k
                    num_layers=2)

TF_SAVEDIR = "/tmp/w266/a4_model"
checkpoint_filename = os.path.join(TF_SAVEDIR, "rnnlm")
trained_filename = os.path.join(TF_SAVEDIR, "rnnlm_trained")

In [21]:
# Will print status every this many seconds
print_interval = 5

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildTrainGraph()

# Explicitly add global initializer and variable saver to LM graph
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()
    saver = tf.train.Saver()
    
# Clear old log directory
shutil.rmtree(TF_SAVEDIR, ignore_errors=True)
if not os.path.isdir(TF_SAVEDIR):
    os.makedirs(TF_SAVEDIR)

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(42)

    session.run(initializer)

    for epoch in range(1,num_epochs+1):
        t0_epoch = time.time()
        bi = utils.rnnlm_batch_generator(train_ids, batch_size, max_time)
        print("[epoch {:d}] Starting epoch {:d}".format(epoch, epoch))
        #### YOUR CODE HERE ####
        # Run a trainingtick_s epoch.
        run_epoch(lm, session, bi, learning_rate=0.01, train=True, verbose=True, tick_s=3600)
        
        #### END(YOUR CODE) ####
        print("[epoch {:d}] Completed in {:s}".format(epoch, utils.pretty_timedelta(since=t0_epoch)))
    
        # Save a checkpoint
        saver.save(session, checkpoint_filename, global_step=epoch)
    
        ##
        # score_dataset will run a forward pass over the entire dataset
        # and report perplexity scores. This can be slow (around 1/2 to 
        # 1/4 as long as a full epoch), so you may want to comment it out
        # to speed up training on a slow machine. Be sure to run it at the 
        # end to evaluate your score.
        if epoch == num_epochs:
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, train_ids, name="Train set")
            print("[epoch {:d}]".format(epoch), end=" ")
            score_dataset(lm, session, test_ids, name="Test set")
            print("")
    
    # Save final model
    saver.save(session, trained_filename)

[epoch 1] Starting epoch 1
[epoch 1] Completed in 0:01:11
[epoch 2] Starting epoch 2
[epoch 2] Completed in 0:01:10
[epoch 3] Starting epoch 3
[epoch 3] Completed in 0:01:10
[epoch 4] Starting epoch 4
[epoch 4] Completed in 0:01:10
[epoch 5] Starting epoch 5
[epoch 5] Completed in 0:01:09
[epoch 6] Starting epoch 6
[epoch 6] Completed in 0:01:10
[epoch 7] Starting epoch 7
[epoch 7] Completed in 0:01:10
[epoch 8] Starting epoch 8
[epoch 8] Completed in 0:01:10
[epoch 9] Starting epoch 9
[epoch 9] Completed in 0:01:11
[epoch 10] Starting epoch 10
[epoch 10] Completed in 0:01:10
[epoch 11] Starting epoch 11
[epoch 11] Completed in 0:01:11
[epoch 12] Starting epoch 12
[epoch 12] Completed in 0:01:11
[epoch 13] Starting epoch 13
[epoch 13] Completed in 0:01:11
[epoch 14] Starting epoch 14
[epoch 14] Completed in 0:01:11
[epoch 15] Starting epoch 15
[epoch 15] Completed in 0:01:10
[epoch 16] Starting epoch 16
[epoch 16] Completed in 0:01:10
[epoch 17] Starting epoch 17
[epoch 17] Completed i

#### Answer: 

Tuning up number of epochs gave most perpplexity reduction. Increasing H (hidden state/embedding size) gives better in-sample fit but similar out of sample fit (sign of overfitting). Stacking another layer of LSTM revert perplexity back to original performance. Doubling vocabulary size gave the worst performance. Final model only increased size of epochs as it gives satisfactory in sample loss and lowest out of sample loss.

## (d) Sampling Sentences (5 points)

If you didn't already in **part (b)**, implement the `BuildSamplerGraph()` method in `rnnlm.py` See the function docstring for more information.

#### Implement the `sample_step()` method below (5 points)
This should access the Tensors you create in `BuildSamplerGraph()`. Given an input batch and initial states, it should return a vector of shape `[batch_size,1]` containing sampled indices for the next word of each batch sequence.

Run the method using the provided code to generate 10 sentences.

In [12]:
def sample_step(lm, session, input_w, initial_h):
    """Run a single RNN step and return sampled predictions.
  
    Args:
      lm : rnnlm.RNNLM
      session: tf.Session
      input_w : [batch_size] vector of indices
      initial_h : [batch_size, hidden_dims] initial state
    
    Returns:
      final_h : final hidden state, compatible with initial_h
      samples : [batch_size, 1] vector of indices
    """
    # Reshape input to column vector
    input_w = np.array(input_w, dtype=np.int32).reshape([-1,1])
  
    #### YOUR CODE HERE ####
    # Run sample ops
    feed_dict = {lm.input_w_:input_w, lm.initial_h_:initial_h, lm.use_dropout_:False}
    final_h, samples = session.run([lm.final_h_, lm.pred_samples_], feed_dict=feed_dict)
    #### END(YOUR CODE) ####
    # Note indexing here: 
    #   [batch_size, max_time, 1] -> [batch_size, 1]
    return final_h, samples[:,-1,:]

In [13]:
# Same as above, but as a batch
max_steps = 20
num_samples = 10
random_seed = 42

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildSamplerGraph()

with lm.graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(random_seed)
    
    # Load the trained model
    saver.restore(session, trained_filename)

    # Make initial state for a batch with batch_size = num_samples
    w = np.repeat([[vocab.START_ID]], num_samples, axis=0)
    h = session.run(lm.initial_h_, {lm.input_w_: w})
    # We'll take one step for each sequence on each iteration 
    for i in range(max_steps):
        h, y = sample_step(lm, session, w[:,-1:], h)
        w = np.hstack((w,y))

    # Print generated sentences
    for row in w:
        for i, word_id in enumerate(row):
            print(vocab.id_to_word[word_id], end=" ")
            if (i != 0) and (word_id == vocab.START_ID):
                break
        print("")

INFO:tensorflow:Restoring parameters from /tmp/w266/a4_model/rnnlm_trained
<s> the president between DGDG , of these circumstance . <s> 
<s> competent , a measure is a line . <s> 
<s> electric <unk> men . <s> 
<s> to organize states , but `` in meaning of sharing its role '' . <s> 
<s> <unk> <unk> surgeon to the composite issue on stage . <s> 
<s> the <unk> of civilization . <s> 
<s> the reckless customs in the <unk> of hospitals and board has around a pulse grounds of food <unk> <unk> many 
<s> shayne thought to nor even it brought state runs on his sense . <s> 
<s> for chemistry , see the <unk> island of the convincing <unk> army department . <s> 
<s> the strains of the linguist . <s> 


## (e) Linguistic Properties (5 points)

Now that we've trained our RNNLM, let's test a few properties of the model to see how well it learns linguistic phenomena. We'll do this with a scoring task: given two or more test sentences, our model should score the more plausible (or more correct) sentence with a higher log-probability.

We'll define a scoring function to help us:

In [14]:
def score_seq(lm, session, seq, vocab):
    """Score a sequence of words. Returns total log-probability."""
    padded_ids = vocab.words_to_ids(utils.canonicalize_words(["<s>"] + seq + ["</s>"], 
                                                             wordset=vocab.word_to_id))
    w = np.reshape(padded_ids[:-1], [1,-1])
    y = np.reshape(padded_ids[1:],  [1,-1])
    h = session.run(lm.initial_h_, {lm.input_w_: w})
    feed_dict = {lm.input_w_:w,
                 lm.target_y_:y,
                 lm.initial_h_:h,
                 lm.dropout_keep_prob_: 1.0}
    # Return log(P(seq)) = -1*loss
    return -1*session.run(lm.loss_, feed_dict)

def load_and_score(inputs, sort=False):
    """Load the trained model and score the given words."""
    lm = rnnlm.RNNLM(**model_params)
    lm.BuildCoreGraph()
    
    with lm.graph.as_default():
        saver = tf.train.Saver()

    with tf.Session(graph=lm.graph) as session:  
        # Load the trained model
        saver.restore(session, trained_filename)

        if isinstance(inputs[0], str) or isinstance(inputs[0], bytes):
            inputs = [inputs]

        # Actually run scoring
        results = []
        for words in inputs:
            score = score_seq(lm, session, words, vocab)
            results.append((score, words))

        # Sort if requested
        if sort: results = sorted(results, reverse=True)

        # Print results
        for score, words in results:
            print("\"{:s}\" : {:.02f}".format(" ".join(words), score))

Now we can test as:

In [15]:
sents = ["once upon a time",
         "the quick brown fox jumps over the lazy dog"]
load_and_score([s.split() for s in sents])

INFO:tensorflow:Restoring parameters from /tmp/w266/a4_model/rnnlm_trained
"once upon a time" : -9.34
"the quick brown fox jumps over the lazy dog" : -8.44


### 1. Number agreement

Compare **"the boy and the girl [are/is]"**. Which is more plausible according to your model?

If your model doesn't order them correctly (*this is OK*), why do you think that might be? (answer in cell below)

In [17]:
#### YOUR CODE HERE ####
sents = ["the boy and the girl is",
         "the boy and the girl are"]
load_and_score([s.split() for s in sents])
#### END(YOUR CODE) ####

INFO:tensorflow:Restoring parameters from /tmp/w266/a4_model/rnnlm_trained
"the boy and the girl is" : -6.59
"the boy and the girl are" : -6.76


#### Answer to part 1. question(s)

Model thinks that `is` is more plausible. 

If the model doesn't order them correctly, it is likely because the LSTM hidden state "forgot" about `the boy` by the time we are using `girl` to predict the next work. The trainable weights in the forget gate haven't converged to a good enough final solution. Higher number of epochs and more training examples may help.

### 2. Type/semantic agreement

Compare:
- **"peanuts are my favorite kind of [nut/vegetable]"**
- **"when I'm hungry I really prefer to [eat/drink]"**

Of each pair, which is more plausible according to your model?

How would you expect a 3-gram language model to perform at this example? How about a 5-gram model? (answer in cell below)

In [18]:
#### YOUR CODE HERE ####
sents = ["peanuts are my favorite kind of nut",
         "peanuts are my favorite kind of vegetable",
         "when I'm hungry I really prefer to eat",
         "when I'm hungry I really prefer to drink"]
load_and_score([s.split() for s in sents])
#### END(YOUR CODE) ####

INFO:tensorflow:Restoring parameters from /tmp/w266/a4_model/rnnlm_trained
"peanuts are my favorite kind of nut" : -8.51
"peanuts are my favorite kind of vegetable" : -7.97
"when I'm hungry I really prefer to eat" : -8.23
"when I'm hungry I really prefer to drink" : -8.65


#### Answer to part 2. question(s)

Model thinks that `vegetable` is more likely than `nut`, and `eat` is more likely than `drink`. Note that `peanut` is not a `nut` but technically a `legume`. Also, it is neither `vegetable` nor `fruit`. So it would be fair to say that the model would feel ambivalent even if it is well-trained.

Neither 3 or 5-gram will work well for these examples. They don't retain long enough memory.
`peanuts are(1) my(2) fav(3) kind(4) of(5) [nut/vegetable]`, peanuts is out of range.
`hungry I(1) really(2) prefer(3) to(4) [eat/drink](5)` , `hungry` is out of range.

### 3. Adjective ordering (just for fun)

Let's repeat the exercise from Week 2:

![Adjective Order](adjective_order.jpg)
*source: https://twitter.com/MattAndersonBBC/status/772002757222002688?lang=en*

We'll consider a toy example (literally), and consider all possible adjective permutations.

Note that this is somewhat sensitive to training, and even a good language model might not get it all correct. Why might the NN fail, if the trigram model from Week 2 was able to solve it?

In [16]:
prefix = "I have lots of".split()
noun = "toys"
adjectives = ["square", "green", "plastic"]
inputs = []
for adjs in itertools.permutations(adjectives):
    words = prefix + list(adjs) + [noun]
    inputs.append(words)
    
load_and_score(inputs, sort=True)

INFO:tensorflow:Restoring parameters from /tmp/w266/a4_model/rnnlm_trained
"I have lots of plastic green square toys" : -8.66
"I have lots of green plastic square toys" : -8.72
"I have lots of green square plastic toys" : -8.77
"I have lots of plastic square green toys" : -8.84
"I have lots of square plastic green toys" : -8.85
"I have lots of square green plastic toys" : -8.94


#### Answer to part 3. question(s)

According to the image, `shape colour material` that is `square green plastic` should be the most natural order. But its log probability is also the lowest. In the training example, if most english sentences indeed adopt the suggested adjective order, then given enough examples, a tri-gram model may triumph by memory. On the other hand, with NN/LSTMs, the long term information captured are rather implicit, predictions are more meaning relevant than order precise.