## Experimenting with Attention
- For now just reading files
- Using my own tokenizer

**Inspiration:**
- https://distill.pub/2016/augmented-rnns/
- https://arxiv.org/pdf/1511.04586v1.pdf
- https://blog.heuritech.com/2016/01/20/attention-mechanism/
- https://www.slideshare.net/KeonKim/attention-mechanisms-with-tensorflow
- https://www.youtube.com/watch?v=ah7_mfl7LD0&t=4131s (minut 17)
- https://www.youtube.com/watch?v=uuPZFWJ-4bE (minut 18:30)
- http://www.manythings.org/anki/

**Improvment ideas:**
- Calculate BLEU Scores
- Reduce "num_words_src" if found is less

**Find out**
- .call og .variables for objects: what do they do?

**Illustration**
The figure below illustrates to some extend how the attention mechanism works. The "attention vector" is calculated by adding the latest hidden output of target to each of the word output of the target. The best match gains the largest value. And this word gets the largest attention during translation:

![title](attention.PNG)

In [1]:
from __future__ import absolute_import, division, print_function

In [2]:
import tensorflow as tf
tf.enable_eager_execution()

from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
tf.__version__

'1.11.0'

In [36]:
import numpy as np
import os
import time
import matplotlib.pyplot as plt

In [4]:
# global variables

num_words_src = 5000       # Limit vocabulary in translation for source language
num_words_tar = 5000       # Limit vocabulary in translation for target language

dataSetSize = 16000        # small dataset = 16085, all data = 9999999
truncate_std_div = 2       # truncate sentences after x tokens, 2 std dev = 95% included
idx = 15000

BATCH_SIZE = 64            # training batch size

embedding_dim = 256        # Embedding dimensions
GRU_units = 1024           # GRU dimension

mark_start = 'ssss '       # start and end markes for destination sentences
mark_end = ' eeee'

## Read data into tables, small dataset

In [5]:
# create lists for source and target texts
source_texts_smallset = []
target_texts_smallset = []

# read file
with open('dan-eng/dan.txt', 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')
    
# split into source and target, add start and end marks
for line in lines[:len(lines)-1]:
    target_sentence, source_sentence = line.split('\t')
    target_sentence = mark_start + target_sentence.strip() + mark_end
    source_texts_smallset.append(source_sentence)
    target_texts_smallset.append(target_sentence)

print('Size of small dataset: ', len(source_texts_smallset))

Size of small dataset:  16085


In [6]:
print('Size of small dataset: ', len(source_texts_smallset))
print(source_texts_smallset[2000])
print(target_texts_smallset[2000])

Size of small dataset:  16085
Hvad har Tom for?
ssss What's Tom up to? eeee


## Read data into tables, large dataset

In [7]:
source_texts = []
target_texts = []

In [8]:
# source into a table, the second and larger dataset
filename = "europarl-v7.da-en.da"
data_dir = "data/europarl/"
path = os.path.join(data_dir, filename)
with open(path, encoding="utf-8") as file:
    # Read the line from file, strip leading and trailing whitespace,
    # prepend the start-text and append the end-text.
    source_texts = [line.strip() for line in file]

In [9]:
# destination into a table, the second and larger dataset
filename = "europarl-v7.da-en.en"
path = os.path.join(data_dir, filename)
with open(path, encoding="utf-8") as file:
    # Read the line from file, strip leading and trailing whitespace,
    # prepend the start-text and append the end-text.
    target_texts = [mark_start + line.strip() + mark_end for line in file]

In [10]:
print('Size of large dataset: ', len(source_texts))
print(source_texts[idx])
print(target_texts[idx])

Size of large dataset:  1968800
Jeg er sikker på, at vi vil få andre ansøgninger om optagelse, og at dette skønne arbejde, dette skønne intellektuelle bygningsværk derfor vil styrte sammen, fordi det bliver overhalet af begivenheder, sådan som vi i nu 20 år er blevet overhalet af alt det, der er sket i det tidligere Jugoslavien.
ssss There are bound to be other applications and this great work, this fine intellectual architecture, will crumble, left behind by events just as we have been left behind by everything that has been happening in the former Yugoslavia for the past 20 years. eeee


## Join the two datasets ... and limit data set size

In [11]:
# join the two data set to one big, gives me both short and long sentences
source_texts = source_texts_smallset + source_texts
target_texts = target_texts_smallset + target_texts
print('Size of small+large dataset: ', len(source_texts))

Size of small+large dataset:  1984885


In [12]:
# shorten data sets to speed up training for easy experimentation
print('Original dataset size:   ', len(source_texts), len(target_texts))
source_texts = source_texts[:dataSetSize]
target_texts = target_texts[:dataSetSize]
print('New lighter dataset size:', len(source_texts), len(target_texts))

Original dataset size:    1984885 1984885
New lighter dataset size: 16000 16000


## Example of training data

In [13]:
# plot some examples
import pandas as pd
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame({'Source texts':source_texts, 'Target texts':target_texts})
df.sample(5)

Unnamed: 0,Source texts,Target texts
9920,Tom ønskede at se Marys værelse.,ssss Tom wanted to see Mary's room. eeee
6121,Jeg plejer at tage mig af opvasken.,ssss I usually do the dishes. eeee
7463,Mary er en oprørsk pige.,ssss Mary is a rebellious girl. eeee
11096,"Han spiller golf, selvom det regner.",ssss He'll play golf even if it rains. eeee
4125,Jeg kan lide at spise æbler.,ssss I like to eat apples. eeee


## Tokenize source sentences

In [14]:
# crate source tokenizer and create vocabulary from the texts
tokenizer_src = Tokenizer(num_words=num_words_src)
tokenizer_src.fit_on_texts(source_texts)
print('Found %s unique source tokens.' % len(tokenizer_src.word_index))

# translate from word sentences to token sentences
tokens_src = tokenizer_src.texts_to_sequences(source_texts)

# Reverse the token-sequences
# tokens_inp = [list(reversed(x)) for x in tokens_inp]

# Shorten the longest token sentences, 
# Find the length of all sentences, truncate after x * std deviations
num_tokens = [len(x) for x in tokens_src]
print('Longest sentence is %s tokens.' % max(num_tokens))
max_tokens_src = np.mean(num_tokens) + truncate_std_div * np.std(num_tokens)
max_tokens_src = min(int(max_tokens_src), max(num_tokens))
print('Sentences shortened to max %s tokens.' % max_tokens_src)

# Pad / truncate all token-sequences to the given length
tokens_padded_src = pad_sequences(tokens_src,
                                  maxlen=max_tokens_src,
                                  padding='post',
                                  truncating='post')

# Create inverse lookup from integer-tokens to words
index_to_word_src = dict(zip(tokenizer_src.word_index.values(), 
                             tokenizer_src.word_index.keys()))

# function to return readable text from tokens string
def tokens_to_string_src(tokens):
    words = [index_to_word_src[token] 
            for token in tokens
            if token != 0]
    text = " ".join(words)
    return text

# demo to show that it works
print('Shape of source tokens:', tokens_padded_src.shape)
print('Source example:')
print('As tokens:    ', tokens_padded_src[idx])
print('As recreated: ', tokens_to_string_src(tokens_padded_src[idx]))
print('As original:  ', source_texts[idx])

Found 7296 unique source tokens.
Longest sentence is 18 tokens.
Sentences shortened to max 10 tokens.
Shape of source tokens: (16000, 10)
Source example:
As tokens:     [  6   1  18 352   7 129  55 950   0   0]
As recreated:  det er et problem du selv må løse
As original:   Det er et problem du selv må løse.


## Tokenize target sentences

In [15]:
# crate source tokenizer and create vocabulary from the texts
tokenizer_tar = Tokenizer(num_words=num_words_tar)
tokenizer_tar.fit_on_texts(target_texts)
print('Found %s unique target tokens.' % len(tokenizer_tar.word_index))

# translate from word sentences to token sentences
tokens_tar = tokenizer_tar.texts_to_sequences(target_texts)

# Shorten the longest token sentences, 
# Find the length of all sentences, truncate after x * std deviations
num_tokens = [len(x) for x in tokens_tar]
print('Longest sentence is %s tokens.' % max(num_tokens))
max_tokens_tar = np.mean(num_tokens) + truncate_std_div * np.std(num_tokens)
max_tokens_tar = min(int(max_tokens_tar), max(num_tokens))
print('Sentences shortened to max %s tokens.' % max_tokens_tar)

# Pad / truncate all token-sequences to the given length
tokens_padded_tar = pad_sequences(tokens_tar,
                                  maxlen=max_tokens_tar,
                                  padding='post',
                                  truncating='post')

# Create inverse lookup from integer-tokens to words
index_to_word_tar = dict(zip(tokenizer_tar.word_index.values(), 
                             tokenizer_tar.word_index.keys()))

# function to return readable text from tokens string
def tokens_to_string_tar(tokens):
    words = [index_to_word_tar[token] 
            for token in tokens
            if token != 0]
    text = " ".join(words)
    return text

# demo to show that it works
print('Shape of target tokens:', tokens_padded_tar.shape)
print('Target example:')
print('As tokens:    ', tokens_padded_tar[idx])
print('As recreated: ', tokens_to_string_tar(tokens_padded_tar[idx]))
print('As original:  ', target_texts[idx])

Found 5438 unique target tokens.
Longest sentence is 18 tokens.
Sentences shortened to max 12 tokens.
Shape of target tokens: (16000, 12)
Target example:
As tokens:     [  1  17   8   5 220   7  13   9 903  81 414   2]
As recreated:  ssss this is a problem you have to solve by yourself eeee
As original:   ssss This is a problem you have to solve by yourself. eeee


In [16]:
# start and end marks as tokens, needed when translating
token_start = tokenizer_tar.word_index[mark_start.strip()]
token_end =   tokenizer_tar.word_index[mark_end.strip()]
print(token_start, token_end)

1 2


## Create a tf.data dataset

In [17]:
BUFFER_SIZE = len(tokens_padded_src)
N_BATCH = BUFFER_SIZE//BATCH_SIZE

dataset = tf.data.Dataset.from_tensor_slices((tokens_padded_src, tokens_padded_tar)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

## Write Encoder Object

In [18]:
class Encoder(tf.keras.Model):

    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.CuDNNGRU(enc_units, 
                                            return_sequences=True, 
                                            return_state=True, 
                                            recurrent_initializer='glorot_uniform')

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)        
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

## Write Decoder object

In [19]:
class Decoder(tf.keras.Model):

    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.CuDNNGRU(dec_units, 
                                            return_sequences=True, 
                                            return_state=True, 
                                            recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)
        
        # used for attention
        self.W1 = tf.keras.layers.Dense(self.dec_units)
        self.W2 = tf.keras.layers.Dense(self.dec_units)
        self.V  = tf.keras.layers.Dense(1)

    def call(self, x, hidden, enc_output):
        
        # enc_output shape == (batch_size, max_length, hidden_size)
           
        # ----------------------------- ATTENTION !
    
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        
        # score shape == (batch_size, max_length, hidden_size)
        score = tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis))
        
        # attention_weights shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying score to self.V
        attention_weights = tf.nn.softmax(self.V(score), axis=1)
        
        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * enc_output
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        # -----------------------------
        
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)
        
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        
        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
        
        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))
        
        # output shape == (batch_size * 1, vocab)
        x = self.fc(output)
        
        return x, state, attention_weights
            
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.dec_units))

## Instantiate Encoder and Decoder

In [20]:
encoder = Encoder(num_words_src, embedding_dim, GRU_units, BATCH_SIZE)
decoder = Decoder(num_words_tar, embedding_dim, GRU_units, BATCH_SIZE)

## Train the model ...

In [21]:
optimizer = tf.train.AdamOptimizer()

def loss_function(real, pred):
  mask = 1 - np.equal(real, 0)
  loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
  return tf.reduce_mean(loss_)

In [22]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [24]:
EPOCHS = 10

for epoch in range(EPOCHS):
    start = time.time()
    
    hidden = encoder.initialize_hidden_state()
    total_loss = 0
    
    for (batch, (inp, targ)) in enumerate(dataset):
        loss = 0
        
        with tf.GradientTape() as tape:
            enc_output, enc_hidden = encoder(inp, hidden)    # enc_output is 64x 10x1024
                                                             # enc_hidden is 64x    1024
            dec_hidden = enc_hidden
            
            # 64x1 tensor with ssss
            dec_input = tf.expand_dims([token_start] * BATCH_SIZE, 1)       
            
            # Teacher forcing - feeding the target as the next input
            for t in range(1, targ.shape[1]):                # targ is 64x12, ie for 12 words
                
                # passing enc_output to the decoder, predictions=64x5000, dec_hidden=64x1024
                predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
                
                loss += loss_function(targ[:, t], predictions)
                
                # using teacher forcing
                dec_input = tf.expand_dims(targ[:, t], 1)
        
        batch_loss = (loss / int(targ.shape[1]))
        
        total_loss += batch_loss
        
        variables = encoder.variables + decoder.variables
        
        gradients = tape.gradient(loss, variables)
        
        optimizer.apply_gradients(zip(gradients, variables))
        
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.numpy()))
    # saving (checkpoint) the model every 2 epochs
    if (epoch + 1) % 2 == 0:
      checkpoint.save(file_prefix = checkpoint_prefix)
    
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                        total_loss / N_BATCH))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 4.7356
Epoch 1 Batch 100 Loss 2.9247
Epoch 1 Batch 200 Loss 2.5249
Epoch 1 Loss 2.9388
Time taken for 1 epoch 35.535956382751465 sec

Epoch 2 Batch 0 Loss 2.3019
Epoch 2 Batch 100 Loss 2.4626
Epoch 2 Batch 200 Loss 1.9072
Epoch 2 Loss 2.1648
Time taken for 1 epoch 35.72203516960144 sec

Epoch 3 Batch 0 Loss 1.7333
Epoch 3 Batch 100 Loss 1.4646
Epoch 3 Batch 200 Loss 1.4315
Epoch 3 Loss 1.5038
Time taken for 1 epoch 34.69463777542114 sec

Epoch 4 Batch 0 Loss 1.0534
Epoch 4 Batch 100 Loss 1.1523
Epoch 4 Batch 200 Loss 0.9760
Epoch 4 Loss 1.0431
Time taken for 1 epoch 35.240270376205444 sec

Epoch 5 Batch 0 Loss 0.7859
Epoch 5 Batch 100 Loss 0.7961
Epoch 5 Batch 200 Loss 0.8847
Epoch 5 Loss 0.7534
Time taken for 1 epoch 35.15317153930664 sec

Epoch 6 Batch 0 Loss 0.6231
Epoch 6 Batch 100 Loss 0.5848
Epoch 6 Batch 200 Loss 0.5742
Epoch 6 Loss 0.5538
Time taken for 1 epoch 35.021015882492065 sec

Epoch 7 Batch 0 Loss 0.4909
Epoch 7 Batch 100 Loss 0.4023
Epoch 7 Batch 2

## Translate functions

In [45]:
# function that does the translation

def translate_sequence(input_seq):
    
    result = ''
    attention_plot = np.zeros((num_words_tar, num_words_src))
    
    # tokenize the text to be translated, pad and convert tto tensor
    input_tokens = tokenizer_src.texts_to_sequences([input_seq])
    input_tokens = pad_sequences(input_tokens,
                                 maxlen=max_tokens_src,
                                 padding='post',
                                 truncating='post')
    input_tokens = tf.convert_to_tensor(input_tokens)
    
    # run encoder on input sentence
    hidden = [tf.zeros((1, GRU_units))]
    enc_out, enc_hidden = encoder(input_tokens, hidden)
    
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([token_start], 0)
 
    for t in range(max_tokens_tar):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        
        # storing the attention weigths to plot later on
  #      attention_weights = tf.reshape(attention_weights, (-1, ))
  #      attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += index_to_word_tar[predicted_id] + ' ' 

        if index_to_word_tar[predicted_id] == '<end>':
            return result, sentence, attention_plot
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, input_seq, attention_plot

In [46]:
translate_sequence('hej med dig')

('hi with you eeee eeee eeee eeee eeee eeee eeee eeee eeee ',
 'hej med dig',
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]))

In [25]:
def evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    
    sentence = preprocess_sentence(sentence)

    inputs = [inp_lang.word2idx[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding='post')
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word2idx['<start>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        
        # storing the attention weigths to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.idx2word[predicted_id] + ' '

        if targ_lang.idx2word[predicted_id] == 'eeee':
            return result, sentence, attention_plot
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

In [37]:
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    
    fontdict = {'fontsize': 14}
    
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()

In [27]:
def translate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    result, sentence, attention_plot = evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)
        
    print('Input: {}'.format(sentence))
    print('Predicted translation: {}'.format(result))
    
    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
    plot_attention(attention_plot, sentence.split(' '), result.split(' '))

## Restore the latest checkpoint and test

In [23]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.checkpointable.util.CheckpointLoadStatus at 0x2adca827518>

In [29]:
translate('hvad hedder du', encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)

NameError: name 'inp_lang' is not defined

## tester ...