## Seminar 6: Recurrent	Neural	Networks and	Natural	Language	Processing

### Encoding and Decoding Sequences

![title](img/1.png)

# Unfolding a RNN in time

![title](img/3.png)

![title](img/2.png)

# Multiple Layers in Recurrent Networks

![title](img/4.png)

### Implementing	Our	First	Recurrent	Network

TensorFlow	supports	various	variants	of
RNNs	that	can	be	found	in	the	 `tf.nn.rnn_cell` 	module.	With	the	 `tf.nn.dynamic_rnn()` 	operation,
TensorFlow	also	implements	the	RNN	dynamics	for	us.
There	is	also	a	version	of	this	function	that	adds	the	unfolded	operations	to	the	graph
instead	of	using	a	loop.	However,	this	consumes	considerably	more	memory	and	has	no
real	benefits.	We	therefore	use	the	newer	 `dynamic_rnn()` 	operation.
As	parameters,	 `dynamic_rnn()` 	takes	a	recurrent	network	definition	and	the	batch	of	input
sequences.	For	now,	the	sequences	all	have	the	same	length.	The	function	creates	the
required	computations	for	the	RNN	to	the	compute	graph	and	returns	two	tensors	holding
the	outputs	and	hidden	states	at	each	time	step.

In [1]:
import tensorflow as tf

# The input data has dimensions batch_size * sequence_length * frame_size.
# To not restrict ourselves to a fixed batch size, we use None as size of
# the first dimension.
sequence_length = 1440
frame_size = 10
data = tf.placeholder(tf.float32, [None, sequence_length, frame_size])

num_neurons = 200
network = tf.contrib.rnn.BasicRNNCell(num_neurons)
# Define the operations to simulate the RNN for sequence_length steps.
outputs, states = tf.nn.dynamic_rnn(network, data, dtype=tf.float32)

### Sequence Classification

In [2]:
import tarfile
import re
import urllib.request
import os
import random

class ImdbMovieReviews:
    """
    The movie review dataset is offered by Stanford University’s AI department:
    http://ai.stanford.edu/~amaas/data/sentiment/. It comes as a compressed  tar  archive where
    positive and negative reviews can be found as text files in two according folders. We apply
    the same pre-processing to the text as in the last section: Extracting plain words using a
    regular expression and converting to lower case.
    """
    DEFAULT_URL = \
        'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
    TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')
    
    def __init__(self):
        self._cache_dir = 'imdb'
        self._url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
        
        if not os.path.isfile(self._cache_dir):
            urllib.request.urlretrieve(self._url, self._cache_dir)
        self.filepath = self._cache_dir

    def __iter__(self):
        with tarfile.open(self.filepath) as archive:
            items = archive.getnames()
            for filename in archive.getnames():
                if filename.startswith('aclImdb/train/pos/'):
                    yield self._read(archive, filename), True
                elif filename.startswith('aclImdb/train/neg/'):
                    yield self._read(archive, filename), False
                    
    def _read(self, archive, filename):
        with archive.extractfile(filename) as file_:
            data = file_.read().decode('utf-8')
            data = type(self).TOKEN_REGEX.findall(data)
            data = [x.lower() for x in data]
            return data

The	code	should	be	straight	forward.	We	just	use	the	vocabulary	to	determine	the	index	of
a	word	and	use	that	index	to	find	the	right	embedding	vector.	The	following	class	also
padds	the	sequences	to	the	same	length	so	we	can	easily	fit	batches	of	multiple	reviews
into	your	network	later.

In [3]:
import numpy as np
# Spacy is my favourite nlp framework, which havu builtin word embeddings trains on wikipesia
from spacy.en import English

class Embedding:
    
    def __init__(self, length):
#          spaCy makes using word vectors very easy. 
#             The Lexeme , Token , Span  and Doc  classes all have a .vector property,
#             which is a 1-dimensional numpy array of 32-bit floats:
        self.parser = English()
        self._length = length
        self.dimensions = 300
        
    def __call__(self, sequence):
        data = np.zeros((self._length, self.dimensions))
        # you can access known words from the parser's vocabulary
        embedded = [self.parser.vocab[w].vector for w in sequence]
        data[:len(sequence)] = embedded
        return data

### Sequence	Labelling	Model

We	want	to	classify	the	sentiment	of	text	sequences.	Because	this	is	a	supervised
setting,	we	pass	two	placeholders	to	the	model:	one	for	the	input	 data ,	or	the	sequence,	and
one	for	the	 target 	value,	or	the	sentiment.	We	also	pass	in	the	 params 	object	that	contains
configuration	parameters	like	the	size	of	the	recurrent	layer,	its	cell	architecture	(LSTM,
GRU,	etc),	and	the	optimizer	to	use.	

In [4]:
from lazy import lazy

class SequenceClassificationModel:
    def __init__(self, data, params):
        self.params = params
        self._create_placeholders()
        self.prediction
        self.cost
        self.error
        self.optimize
        self._create_summaries()
    
    def _create_placeholders(self):
        with tf.name_scope("data"):
            self.data = tf.placeholder(tf.float32, [None, self.params.seq_length, self.params.embed_length])
            self.target = tf.placeholder(tf.float32, [None, 2])
  
    def _create_summaries(self):
        with tf.name_scope("summaries"):
            tf.summary.scalar('loss', self.cost)
            tf.summary.scalar('erroe', self.error)
            self.summary = tf.summary.merge_all()
            saver = tf.train.Saver()
            
    @lazy
    def length(self):
    # First, we obtain the lengths of sequences in the current data batch. We need this since
    # the data comes as a single tensor, padded with zero vectors to the longest review length.
    # Instead of keeping track of the sequence lengths of every review, we just compute it
    # dynamically in TensorFlow.
    
        with tf.name_scope("seq_length"):
            used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2))
            length = tf.reduce_sum(used, reduction_indices=1)
            length = tf.cast(length, tf.int32)
        return length
    
    @lazy
    def prediction(self):
    # Note that the last relevant output activation of the RNN has a different index for each
    # sequence in the training batch. This is because each review has a different length. We
    # already know the length of each sequence.
    # The problem is that we want to index in the dimension of time steps, which is
    # the second dimension in the batch of shape  (sequences, time_steps, word_vectors) .
    
        with tf.name_scope("recurrent_layer"):
            output, _ = tf.nn.dynamic_rnn(
                self.params.rnn_cell(self.params.rnn_hidden),
                self.data,
                dtype=tf.float32,
                sequence_length=self.length
            )
        last = self._last_relevant(output, self.length)

        with tf.name_scope("softmax_layer"):
            num_classes = int(self.target.get_shape()[1])
            weight = tf.Variable(tf.truncated_normal(
                [self.params.rnn_hidden, num_classes], stddev=0.01))
            bias = tf.Variable(tf.constant(0.1, shape=[num_classes]))
            prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
        return prediction
    
    @lazy
    def cost(self):
        cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction))
        return cross_entropy
    
    @lazy
    def error(self):
        self.mistakes = tf.not_equal(
            tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))
        return tf.reduce_mean(tf.cast(self.mistakes, tf.float32))
    
    @lazy
    def optimize(self):
    # RNNs are quite hard to train and weights tend to diverge if the hyper parameters do not
    # play nicely together. The idea of gradient clipping is to restrict the the values of the
    # gradient to a sensible range. This way, we can limit the maximum weight updates.

        with tf.name_scope("optimization"):
            gradient = self.params.optimizer.compute_gradients(self.cost)
            if self.params.gradient_clipping:
                limit = self.params.gradient_clipping
                gradient = [
                    (tf.clip_by_value(g, -limit, limit), v)
                    if g is not None else (None, v)
                    for g, v in gradient]
            optimize = self.params.optimizer.apply_gradients(gradient)
        return optimize
    
    @staticmethod
    def _last_relevant(output, length):
        with tf.name_scope("last_relevant"):
            # As of now, TensorFlow only supports indexing along the first dimension, using
            # tf.gather() . We thus flatten the first two dimensions of the output activations from their
            # shape of  sequences x time_steps x word_vectors  and construct an index into this resulting tensor.
            batch_size = tf.shape(output)[0]
            max_length = int(output.get_shape()[1])
            output_size = int(output.get_shape()[2])

            # The index takes into account the start indices for each sequence in the flat tensor and adds
            # the sequence length to it. Actually, we only add  length - 1  so that we select the last valid
            # time step.
            index = tf.range(0, batch_size) * max_length + (length - 1)
            flat = tf.reshape(output, [-1, output_size])
            relevant = tf.gather(flat, index)
        return relevant

In [5]:
def preprocess_batched(iterator, length, embedding, batch_size):
    iterator = iter(iterator)
    while True:
        data = np.zeros((batch_size, length, embedding.dimensions))
        target = np.zeros((batch_size, 2))
        for index in range(batch_size):
            text, label = next(iterator)
            data[index] = embedding(text)
            target[index] = [1, 0] if label else [0, 1]
        yield data, target

In [6]:
reviews = list(ImdbMovieReviews())

In [7]:
random.shuffle(reviews)

In [8]:
length = max(len(x[0]) for x in reviews)
embedding = Embedding(length)

In [9]:
from attrdict import AttrDict

params = AttrDict(
    rnn_cell=tf.contrib.rnn.GRUCell,
    rnn_hidden=300,
    optimizer=tf.train.RMSPropOptimizer(0.002),
    batch_size=20,
    gradient_clipping=100,
    seq_length=length,
    embed_length=embedding.dimensions
)

In [10]:
batches = preprocess_batched(reviews, length, embedding, params.batch_size)

In [None]:
tf.reset_default_graph()

model = SequenceClassificationModel(data, params)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [None]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    summary_writer = tf.summary.FileWriter('graphs', sess.graph)
    for index, batch in enumerate(batches):
        feed = {model.data: batch[0], model.target: batch[1]}
        error, _, summary_str = sess.run([model.error, model.optimize, model.summary], feed)
        print('{}: {:3.1f}%'.format(index + 1, 100 * error))
        if index % 1 == 0:
            summary_writer.add_summary(summary_str, index)


1: 45.0%
2: 50.0%
3: 25.0%
4: 70.0%
5: 30.0%
6: 40.0%
7: 55.0%
8: 50.0%
9: 40.0%
10: 60.0%
11: 40.0%
12: 40.0%
13: 35.0%
14: 65.0%
15: 50.0%
16: 55.0%
17: 60.0%
18: 45.0%
19: 70.0%
20: 55.0%
21: 55.0%
22: 45.0%
23: 45.0%
24: 60.0%
25: 55.0%
26: 55.0%
27: 50.0%
28: 35.0%
29: 45.0%
30: 35.0%
31: 55.0%
32: 60.0%
33: 40.0%
34: 65.0%
35: 35.0%
36: 45.0%
37: 45.0%
38: 50.0%
39: 70.0%
40: 65.0%
41: 45.0%
42: 60.0%
43: 45.0%
44: 50.0%
45: 60.0%
46: 60.0%
47: 30.0%
48: 70.0%
49: 50.0%
50: 40.0%
51: 60.0%
52: 35.0%
53: 60.0%
54: 55.0%
55: 45.0%
56: 35.0%
57: 45.0%
58: 50.0%
59: 45.0%
60: 55.0%
61: 25.0%
62: 45.0%
63: 45.0%
64: 50.0%
65: 35.0%
66: 55.0%
67: 40.0%
68: 45.0%
69: 60.0%
70: 35.0%
71: 60.0%
72: 45.0%
73: 65.0%
74: 55.0%
75: 60.0%
76: 55.0%
77: 60.0%
78: 25.0%
79: 55.0%
80: 40.0%
81: 35.0%
82: 30.0%
83: 50.0%
84: 40.0%
85: 35.0%
86: 65.0%
87: 65.0%
88: 50.0%
89: 40.0%
90: 35.0%
91: 40.0%
92: 40.0%
93: 35.0%
94: 40.0%
95: 40.0%
96: 45.0%
97: 60.0%
98: 60.0%
99: 40.0%
100: 20.0%
101: 30.

761: 25.0%
762: 15.0%
763: 20.0%
764: 20.0%
765: 15.0%
766: 20.0%
767: 5.0%
768: 20.0%
769: 15.0%
770: 20.0%
771: 25.0%
772: 10.0%
773: 20.0%
774: 10.0%
775: 15.0%
776: 5.0%
777: 15.0%
778: 25.0%
779: 15.0%
780: 20.0%
781: 15.0%
782: 20.0%
783: 15.0%
784: 30.0%
785: 35.0%
786: 25.0%
787: 20.0%
788: 25.0%
789: 5.0%
790: 25.0%
791: 10.0%
792: 15.0%
793: 15.0%
794: 30.0%
795: 10.0%
796: 10.0%
797: 5.0%
798: 5.0%
799: 20.0%
800: 10.0%
801: 5.0%
802: 25.0%
803: 10.0%
804: 30.0%
805: 25.0%
806: 5.0%
807: 25.0%
808: 10.0%
809: 15.0%
810: 10.0%
811: 20.0%
812: 10.0%
813: 10.0%
814: 20.0%
815: 10.0%
816: 5.0%
817: 10.0%
818: 10.0%
819: 30.0%
820: 5.0%
821: 10.0%
822: 10.0%
823: 10.0%
824: 40.0%
825: 15.0%
826: 10.0%
827: 20.0%
828: 15.0%
829: 20.0%
830: 10.0%
831: 15.0%
832: 10.0%
833: 10.0%
834: 20.0%
835: 10.0%
836: 15.0%
837: 45.0%
838: 5.0%
839: 5.0%
840: 25.0%
841: 30.0%
842: 30.0%
843: 5.0%
844: 20.0%
845: 5.0%
846: 15.0%
847: 15.0%
848: 15.0%
849: 10.0%
850: 0.0%
851: 20.0%
852: 20.0%
85