## Machine translator v/01
First attempt to make a seq2seq machine translator. 
Credits, inspiration, and thanks to:
- Hvass @ https://github.com/Hvass-Labs
- Challet @ https://github.com/fchollet/deep-learning-with-python-notebooks
- Keras Blog https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Ideas for improvment:
- Reverse word sequence in source text
- Bi-directional RNN
- Attention mechanism
- Save Tokenizer: https://stackoverflow.com/questions/45735070/keras-text-preprocessing-saving-tokenizer-object-to-file-for-scoring

## Import modules

In [1]:
import tensorflow as tf
tf.__version__

  from ._conv import register_converters as _register_converters


'1.5.0'

In [2]:
from tensorflow.python.keras.models import Model
tf.keras.__version__

'2.1.2-tf'

In [3]:
import numpy as np

## Download data
Load data from the internet and extract from tar-file. Data from http://www.statmt.org/europarl/

In [4]:
import sys
import os
import urllib.request
import tarfile
import zipfile

# data location in my system and on the internet
data_dir = "data/europarl/"
data_url = "http://www.statmt.org/europarl/v7/"

# full url to data
url = data_url + 'da' + "-en.tgz"

# function to print download progress
def _print_download_progress(count, block_size, total_size):
    pct_complete = float(count * block_size) / total_size
    pct_complete = min(1.0, pct_complete)
    msg = "\r- Download progress: {0:.1%}".format(pct_complete)
    sys.stdout.write(msg)
    sys.stdout.flush()

# set file name and "save" path
filename = url.split('/')[-1]
file_path = os.path.join(data_dir, filename)

# if file does not exist, then download and extract
if not os.path.exists(file_path):
    
    # make dir, if not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)

    # Download the file from the internet.
    file_path, _ = urllib.request.urlretrieve(url=url, filename=file_path, reporthook=_print_download_progress)
    print()
    print("Download finished. Extracting files.")

    # unzip or untar
    if file_path.endswith(".zip"):
        zipfile.ZipFile(file=file_path, mode="r").extractall(data_dir)
    elif file_path.endswith((".tar.gz", ".tgz")):
        tarfile.open(name=file_path, mode="r:gz").extractall(data_dir)
    print("Done.")
else:
    print("Data has apparently already been downloaded and unpacked.")

Data has apparently already been downloaded and unpacked.


## Source and destination text into tables

In [5]:
# markers to mark start and end of destination texts
mark_start = 'ssss '
mark_end = ' eeee'

In [6]:
# source into a table
filename = "europarl-v7.da-en.da"
path = os.path.join(data_dir, filename)
with open(path, encoding="utf-8") as file:
    # Read the line from file, strip leading and trailing whitespace,
    # prepend the start-text and append the end-text.
    data_src = [line.strip() for line in file]

In [7]:
# destination into a table
filename = "europarl-v7.da-en.en"
path = os.path.join(data_dir, filename)
with open(path, encoding="utf-8") as file:
    # Read the line from file, strip leading and trailing whitespace,
    # prepend the start-text and append the end-text.
    data_dest = [mark_start + line.strip() + mark_end for line in file]

## Example data

In [8]:
i = 2
print(data_src[i])
print(data_dest[i])

Som De kan se, indfandt det store "år 2000-problem" sig ikke. Til gengæld har borgerne i en del af medlemslandene været ramt af meget forfærdelige naturkatastrofer.
ssss Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful. eeee


## Reduce dataset size
to reduce training time dureing building model and experimentation

In [9]:
print
print('Original dataset size:   ', len(data_src), len(data_dest))
dataSetSize = 10000
data_src = data_src[:dataSetSize]
data_dest = data_dest[:dataSetSize]
print('New lighter dataset size:', len(data_src), len(data_dest))

Original dataset size:    1968800 1968800
New lighter dataset size: 10000 10000


## Tokenize and pad SOURCE language
Key outputs are:
- "data_src", raw text inputs
- "tokens_padded_src", a table that contains all the padded texts converted to tokens
- "tokens_to_string_src", a function that transforms a token-list to a readable text

In [10]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

num_words = 10000

Using TensorFlow backend.


In [11]:
# crate source tokenizer and create vocabulary from the texts
tokenizer_src = Tokenizer(num_words=num_words)
tokenizer_src.fit_on_texts(data_src)
print('Found %s unique source tokens.' % len(tokenizer_src.word_index))

Found 16800 unique source tokens.


In [12]:
# translate from words to tokens
tokens_src = tokenizer_src.texts_to_sequences(data_src)

In [13]:
# Shorten the longest tokens, Find the length of all sentences, truncate after 2 * std deviation
num_tokens = [len(x) for x in tokens_src]
max_tokens_src = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens_src = int(max_tokens_src)

In [14]:
# Pad / truncate all token-sequences to the given length.
# This creates a 2-dim numpy matrix that is easier to use.
tokens_padded_src = pad_sequences(tokens_src,
                                  maxlen=max_tokens_src,
                                  padding='post',
                                  truncating='post')
print(tokens_padded_src.shape)

(10000, 51)


In [15]:
# Create inverse lookup from integer-tokens to words
index_to_word_src = dict(zip(tokenizer_src.word_index.values(), tokenizer_src.word_index.keys()))

# function to return readable text from tokens string
def tokens_to_string_src(tokens):
    words = [index_to_word_src[token] 
            for token in tokens
            if token != 0]
    text = " ".join(words)
    return text    

In [16]:
# demo to show that it works
idx = 2
tokens_to_string_src(tokens_padded_src[idx])

'som de kan se indfandt det store år 2000 problem sig ikke til gengæld har borgerne i en del af medlemslandene været ramt af meget forfærdelige naturkatastrofer'

In [17]:
data_src[idx]

'Som De kan se, indfandt det store "år 2000-problem" sig ikke. Til gengæld har borgerne i en del af medlemslandene været ramt af meget forfærdelige naturkatastrofer.'

In [18]:
tokens_padded_src[idx]

array([  14,    9,   24,  138, 8374,    4,  124,   77,  171,  399,   34,
         19,    8, 2090,   20,  280,    3,   10,  165,    7,  778,  120,
       1101,    7,   45, 4008, 1939,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0])

## Tokenize and pad DESTINATION language
Key outputs are:
- "data_dest", raw text inputs
- "tokens_padded_dest", a table that contains all the padded texts converted to tokens
- "tokens_to_string_dest", a function that transforms a token-list to a readable text

In [19]:
# crate destination tokenizer and create vocabulary from the texts
tokenizer_dest = Tokenizer(num_words=num_words)
tokenizer_dest.fit_on_texts(data_dest)
print('Found %s unique destination tokens.' % len(tokenizer_dest.word_index))

Found 10902 unique destination tokens.


In [20]:
# translate from words to tokens
tokens_dest = tokenizer_dest.texts_to_sequences(data_dest)

In [21]:
# Shorten the longest tokens, Find the length of all sentences, truncate after 2 * std deviation
num_tokens = [len(x) for x in tokens_dest]
max_tokens_dest = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens_dest = int(max_tokens_dest)

In [22]:
# Pad / truncate all token-sequences to the given length.
# This creates a 2-dim numpy matrix that is easier to use.
tokens_padded_dest = pad_sequences(tokens_dest,
                                   maxlen=max_tokens_dest,
                                   padding='post',
                                   truncating='post')
print(tokens_padded_dest.shape)

(10000, 58)


In [23]:
# Create inverse lookup from integer-tokens to words
index_to_word_dest = dict(zip(tokenizer_dest.word_index.values(), tokenizer_dest.word_index.keys()))

# function to return readable text from tokens string
def tokens_to_string_dest(tokens):
    words = [index_to_word_dest[token] 
            for token in tokens
            if token != 0]
    text = " ".join(words)
    return text    

In [24]:
# start and end marks as tokens
token_start = tokenizer_dest.word_index[mark_start.strip()]
token_end = tokenizer_dest.word_index[mark_end.strip()]

In [25]:
# demo to show that it works
idx = 2
tokens_to_string_dest(tokens_padded_dest[idx])

"ssss although as you will have seen the dreaded 'millennium bug' failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful eeee"

In [26]:
data_dest[idx]

"ssss Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful. eeee"

In [27]:
tokens_padded_dest[idx]

array([   2,  390,   21,   35,   24,   20,  592,    1, 6710, 6711, 6712,
       1757,    5, 4291,  186,    1,   93,    7,    9,  246,    4,  120,
       1967,    9, 1599,    4, 1032,  871,   10, 1216,  115, 4292,    3,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0])

## Training data
- Input to the encoder is simply the source language as it is
- Inputs to the decoder are slightly more complicated, since the two input strings are shiften one time-step: The model has to learn to predict the "next" token in the output from the input. Slizing is used to get two "views" to the data

In [28]:
encoder_input_data = tokens_padded_src
encoder_input_data.shape

(10000, 51)

In [29]:
decoder_input_data = tokens_padded_dest[:, :-1]
decoder_input_data.shape

(10000, 57)

In [30]:
decoder_output_data = tokens_padded_dest[:, 1:]
decoder_output_data.shape

(10000, 57)

Examples showing the training data to the model

In [31]:
idx = 2
decoder_input_data[idx]

array([   2,  390,   21,   35,   24,   20,  592,    1, 6710, 6711, 6712,
       1757,    5, 4291,  186,    1,   93,    7,    9,  246,    4,  120,
       1967,    9, 1599,    4, 1032,  871,   10, 1216,  115, 4292,    3,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0])

In [32]:
decoder_output_data[idx]

array([ 390,   21,   35,   24,   20,  592,    1, 6710, 6711, 6712, 1757,
          5, 4291,  186,    1,   93,    7,    9,  246,    4,  120, 1967,
          9, 1599,    4, 1032,  871,   10, 1216,  115, 4292,    3,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0])

In [33]:
tokens_to_string_dest(decoder_input_data[idx])

"ssss although as you will have seen the dreaded 'millennium bug' failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful eeee"

In [34]:
tokens_to_string_dest(decoder_output_data[idx])

"although as you will have seen the dreaded 'millennium bug' failed to materialise still the people in a number of countries suffered a series of natural disasters that truly were dreadful eeee"

## Creating the Neural Network
### Create the Encoder model

In [35]:
from tensorflow.python.keras.layers import Input, Embedding, GRU, Dense
from tensorflow.python.keras.optimizers import RMSprop
from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard

In [36]:
# network sizes
embedding_size = 128
state_size = 512

# connect encoder
encoder_input = Input(shape=(None, ), name='encoder_input')
net = Embedding(input_dim=num_words, output_dim=embedding_size, name='encoder_embedding')(encoder_input)
net = GRU(state_size, name='encoder_gru1', return_sequences=True)(net)
net = GRU(state_size, name='encoder_gru2', return_sequences=True)(net)
net = GRU(state_size, name='encoder_gru3', return_sequences=False)(net)
encoder_output = net

# Encoder model
model_encoder = Model(inputs=[encoder_input],
                      outputs=[encoder_output])
model_encoder.summary()

Instructions for updating:
keep_dims is deprecated, use keepdims instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
encoder_input (InputLayer)   (None, None)              0         
_________________________________________________________________
encoder_embedding (Embedding (None, None, 128)         1280000   
_________________________________________________________________
encoder_gru1 (GRU)           (None, None, 512)         984576    
_________________________________________________________________
encoder_gru2 (GRU)           (None, None, 512)         1574400   
_________________________________________________________________
encoder_gru3 (GRU)           (None, 512)               1574400   
Total params: 5,413,376
Trainable params: 5,413,376
Non-trainable params: 0
_________________________________________________________________


### Create Decoder model
Create the decoder-part which maps the "thought vector" to a sequence of integer-tokens. The decoder takes two inputs. First it needs the "thought vector" produced by the encoder which summarizes the contents of the input-text.

In [37]:
# initial state for the decoder (given from encoder)
decoder_initial_state = Input(shape=(state_size,), name='decoder_initial_state')

# input to decoder (destination text given to estimate next word)
decoder_input = Input(shape=(None, ), name='decoder_input')

# connect decoder
net  = Embedding(input_dim=num_words, output_dim=embedding_size, name='decoder_embedding')(decoder_input)
net = GRU(state_size, name='decoder_gru1', return_sequences=True)(net, initial_state=decoder_initial_state)
net = GRU(state_size, name='decoder_gru2', return_sequences=True)(net, initial_state=decoder_initial_state)
net = GRU(state_size, name='decoder_gru3', return_sequences=True)(net, initial_state=decoder_initial_state)
decoder_output = Dense(num_words, activation='linear', name='decoder_output')(net)

# Decoder model
model_decoder = Model(inputs=[decoder_input, decoder_initial_state],
                      outputs=[decoder_output])
model_decoder.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
decoder_embedding (Embedding)   (None, None, 128)    1280000     decoder_input[0][0]              
__________________________________________________________________________________________________
decoder_initial_state (InputLay (None, 512)          0                                            
__________________________________________________________________________________________________
decoder_gru1 (GRU)              (None, None, 512)    984576      decoder_embedding[0][0]          
                                                                 decoder_initial_state[0][0]      
__________

### Create TRAINING model
This model connect from the input language to the translated language

In [38]:
# input to decoder (destination text given to estimate next word)
decoder_input = Input(shape=(None, ), name='decoder_input')

# embedding layer
net  = Embedding(input_dim=num_words, output_dim=embedding_size, name='decoder_embedding')(decoder_input)

# connect training model
net = GRU(state_size, name='decoder_gru1', return_sequences=True)(net, initial_state=encoder_output)
net = GRU(state_size, name='decoder_gru2', return_sequences=True)(net, initial_state=encoder_output)
net = GRU(state_size, name='decoder_gru3', return_sequences=True)(net, initial_state=encoder_output)
decoder_output = Dense(num_words, activation='linear', name='decoder_output')(net)

# Model to train the network
model_train = Model(inputs=[encoder_input, decoder_input],
                    outputs=[decoder_output])
model_train.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
encoder_embedding (Embedding)   (None, None, 128)    1280000     encoder_input[0][0]              
__________________________________________________________________________________________________
encoder_gru1 (GRU)              (None, None, 512)    984576      encoder_embedding[0][0]          
__________________________________________________________________________________________________
decoder_input (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
encoder_gr

### Loss function

In [39]:
def sparse_cross_entropy(y_true, y_pred):
    # Calculate the loss. This outputs a 2-rank tensor of shape [batch_size, sequence_length]
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
    loss_mean = tf.reduce_mean(loss)
    return loss_mean

### Compile the models

In [40]:
optimizer = RMSprop(lr=1e-3)

In [41]:
decoder_target = tf.placeholder(dtype='int32', shape=(None, None))

In [42]:
optimizeroptimize  = RMSprop(lr=1e-3)
decoder_targetdecoder_  = tf.placeholder(dtype='int32', shape=(None, None))
model_train.compile(optimizer=optimizer,
                    loss=sparse_cross_entropy,
                    target_tensors=[decoder_target])

Instructions for updating:
keep_dims is deprecated, use keepdims instead


### Callback functions

In [43]:
path_checkpoint = 'tgc_checkpoint.keras'
callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint,
                                      monitor='val_loss',
                                      verbose=1,
                                      save_weights_only=True,
                                      save_best_only=True)

callback_early_stopping = EarlyStopping(monitor='val_loss',
                                        patience=3, verbose=1)

callback_tensorboard = TensorBoard(log_dir='./21_logs/',
                                   histogram_freq=0,
                                   write_graph=False)

callbacks = [callback_early_stopping,
             callback_checkpoint,
             callback_tensorboard]

## Train the model

In [44]:
x_data = {'encoder_input': encoder_input_data, 'decoder_input': decoder_input_data}
y_data = {'decoder_output' : decoder_output_data}

model_train.fit(x=x_data,
                y=y_data,
                batch_size=640,
                epochs=10,
                validation_split=0.1,
                callbacks=callbacks)

Train on 9000 samples, validate on 1000 samples
Epoch 1/10

Epoch 2/10

Epoch 3/10

Epoch 4/10

Epoch 5/10

Epoch 6/10

Epoch 7/10

Epoch 8/10

Epoch 9/10

Epoch 10/10



<tensorflow.python.keras._impl.keras.callbacks.History at 0x1e481873e10>

## Translate texts

In [50]:
def translate(input_text, true_output_text=None):

    # tokenize the text to be translated
    input_tokens = tokenizer_src.texts_to_sequences([input_text])
    input_tokens = pad_sequences(input_tokens,
                                 maxlen=max_tokens_dest,
                                 padding='post',
                                 truncating='post')
    
    # calculate thought vector
    initial_state = model_encoder.predict(input_tokens)
    
    # create placeholder for translated text
    shape = (1, max_tokens_dest)
    decoder_input_data = np.zeros(shape=shape, dtype=np.int)
    
    # set helper variables
    token_int = token_start
    output_text = ''
    count_tokens = 0
    
    while token_int != token_end and count_tokens < max_tokens_dest:
                   
        # decoder input, initially hust the "ssss" marker as a token
        decoder_input_data[0, count_tokens] = token_int
            
        # wrap data for clearity
        x_data = {'decoder_initial_state': initial_state, 'decoder_input': decoder_input_data}
            
        # run decoder to predict next word
        decoder_output = model_decoder.predict(x_data)
            
        # get the last predicted token
        token_onehot = decoder_output[0, count_tokens, :]
        
        # convert to an integer token, the index in the one-hot
        token_int = np.argmax(token_onehot)
            
        # lookup the word in plain letters
        sampled_word = index_to_word_src[token_int]
            
        # Append the word to the output-text.
        output_text += " " + sampled_word
            
        # Increment the token-counter
        count_tokens += 1
            
    # Sequence of tokens output by the decoder.
    output_tokens = decoder_input_data[0]

    # Print the input-text.
    print("Input text:")
    print(input_text)
    print()

    # Print the translated output-text.
    print("Translated text:")
    print(output_text)
    print()

    # Optionally print the true translated text.
    if true_output_text is not None:
        print("True output text:")
        print(true_output_text)
        print()

In [51]:
idx = 2
translate(input_text=data_src[idx],
          true_output_text=data_dest[idx])

Input text:
Som De kan se, indfandt det store "år 2000-problem" sig ikke. Til gengæld har borgerne i en del af medlemslandene været ramt af meget forfærdelige naturkatastrofer.

Translated text:
 tilladelsesforbehold tilladelsesforbehold retsorden overvejelse tjenestemændenes tjenestemændenes begået begået begået medtages medtages begået begået begået etiske medtages begået begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske etiske begået etiske

True output text:
ssss Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful. eeee

