# Machine Translation in `Python`

(Source Language) $\longrightarrow$ (Target Language)  

One-hot encoded vectors:  

* a sparse vector of ones and zeros
    * 1: token is present
    * 0: token is not present
* vector length is determines by the size of the vocabulary
    * vocabulary = set of tokens in dataset

In [14]:
# mapping that contaains words and their corresponding indices
word2index = { 'I':0, 'like':1, 'cats':2 }
# converting words to IDs or indices
words = [ 'I', 'like', 'cats' ]
word_ids = [ word2index[w] for w in words ]
print( word_ids )

[0, 1, 2]


In [15]:
# one-hot encoding with keras
from keras.utils.np_utils import to_categorical

onehot_1 = to_categorical( word_ids, num_classes=5 )
print( [ (w,ohe.tolist()) for w,ohe in zip( words, onehot_1 )])

[('I', [1.0, 0.0, 0.0, 0.0, 0.0]), ('like', [0.0, 1.0, 0.0, 0.0, 0.0]), ('cats', [0.0, 0.0, 1.0, 0.0, 0.0])]


In [116]:
# exploring the `to_categorical()` function
def compute_onehot_length(words, word2index):
  # Create word IDs for words
  word_ids = [word2index[w] for w in words]
  # Convert word IDs to onehot vectors
  onehot = to_categorical(word_ids)
  # Return the length of a single one-hot vector
  return onehot.shape[1]

word2index = {"He":0, "drank": 1, "milk": 2}
# Compute and print onehot length of a list of words
print(compute_onehot_length(['He','drank','milk'], word2index))

3


In [117]:
# use the num_classes parameter to set the length of the vectors
word2index = {'He': 6,'I': 0,'We': 3,'cats': 2,'dogs': 5,'hates': 7,'like': 4,'rabbits': 8}
words_1 = ["I", "like", "cats", "We", "like", "dogs", "He", "hates", "rabbits"]
# Call compute_onehot_length on words_1
length_1 = compute_onehot_length(words_1, word2index)

words_2 = ["I", "like", "cats", "We", "like", "dogs", "We", "like", "cats"]
# Call compute_onehot_length on words_2
length_2 = compute_onehot_length(words_2, word2index)

# Print length_1 and length_2
print("length_1 =>", length_1, " and length_2 => ", length_2)

length_1 => 9  and length_2 =>  6


<br>

### Encoder Decoder Model

A machine translation model works by, first, consuming words of the source language sequentially, and then, sequentially predicting the corresponding words in the target language  

Input $\longrightarrow$ **Encoder Model** $\longrightarrow$ Context Vector $\longrightarrow$  **Decoder Model** $\longrightarrow$ Output

**Writing the Encoder**  

    def words2onehot( word_list, word2index ):
        word_ids = [word2index[w] for w in word_list]
        onehot = to_categorical( word_ids, 3 )
        return onehot
        
    def encoder( onehot ):
        word_ids = np.argmax( onehot, axis=1 ):
        return word_ids 
        
    onehot = word2onehot(["I', 'like', 'cats']), words2index )
    context = encoder( onehot )
    print( context )
    
**Writing the Decoder**  

    def decoder( context_vector ):
        word_ids_rev = context_vector[::-1]
        onehot_rev = to_categorical( word_ids_rev, 3 )
        return onehot_rev
        
    def onehot2words( onehot, index2words):
        ids = np.argmax( onehot, axis = 1 )
        return [indext2word[id] for id in ids]
        
    onehot_rev = decoder( context )
    reversed_words = onehot2words( onehot_rev, index2word )
    print( reversed_words )
    
<br>

In [118]:
# The encoder

import numpy as np

word2index = {'I': 0, 'cats': 2, 'like': 1}

def words2onehot(word_list, word2index):
  # Convert words to word IDs
  word_ids = [word2index[w] for w in word_list]
  # Convert word IDs to onehot vectors and return the onehot array
  onehot = to_categorical(word_ids, num_classes=3)
  return onehot

words = ["I", "like", "cats"]
# Convert words to onehot vectors using words2onehot
onehot = words2onehot(words, word2index)
# Print the result as (<word>, <onehot>) tuples
print([(w,ohe.tolist()) for w,ohe in zip(words, onehot)])

[('I', [1.0, 0.0, 0.0]), ('like', [0.0, 1.0, 0.0]), ('cats', [0.0, 0.0, 1.0])]


In [119]:
# Encoder: Text reversing model
def encoder(onehot):
  # Get word IDs from onehot vectors and return the IDs
  word_ids = np.argmax(onehot, axis=1)
  return word_ids

# Define "We like dogs" as words
words = ['We','like','dogs']
# Define the word2index dict
word2index = {'We': 0, 'dogs': 2, 'like': 1}

# Convert words to onehot vectors using words2onehot
onehot = words2onehot(words, word2index)
# Get the context vector by using the encoder function
context = encoder(onehot)
print(context)

[0 1 2]


In [120]:
index2word = {0: 'We', 1: 'like', 2: 'dogs'}
# Implementing the Decoder
# Define the onehot2words function that returns words for a set of onehot vectors
def onehot2words(onehot, index2word):
  ids = np.argmax(onehot, axis=1)
  res = [index2word[id] for id in ids]
  return res
# Define the decoder function that returns reversed onehot vectors
def decoder(context_vector):
  word_ids_rev = context_vector[::-1]
  onehot_rev = to_categorical(word_ids_rev, num_classes=3)
  return onehot_rev
# Convert context to reversed onehot vectors using decoder
onehot_rev = decoder(context)
# Get the reversed words using the onehot2words function
reversed_words = onehot2words(onehot_rev, index2word)
print(reversed_words)

['dogs', 'like', 'We']


<br>

### Understanding Sequential Models

**Time Series inputs and Sequential Models**  

* sentences as time series input
    * current word is affected by the previous words
* The encoder/decoder uses a machine leaarning model that: 
    * **sequential model** - can learn from times series inputs 
    
**Gated Recurrent Unit (GRU)** - sequential GRU units take in input ad pass a hidden state to the next unit until the sequence is processes. the hidden states at each unit represent the 'memory' of what the model has seen.  

**`Keras` (functional API) refresher**  

* `Keras` has two important objects: `Layer` and `Model` objects
* Input Layer
    * `inp = keras.layers.Input( shape = (...))`
* Hidden Layer
    * `layer = keras.layers.GRU(...)`
* Output
    * `out = layer( inp )`
* Model
    * `mode = Model( inputs=inp, outputs=out )`
    
**Understanding the Shape of the Data**  
* Sequence data is 3-dimensional
    1. **batch dimension** - the number of sequences
    2. **time dimension** - the length of the sequences
    3. **Input dimention** - length of the onehot vector (vocab length)
    
** Implementing GRUs with `Keras`**  

Defining `Keras` layers:  

    inp = keras.layers.Input( batchdim, timedim, inputdim ) 
    #for a model that takes arbitrary number of samples, leave out batchdim
    gru_out, gru_state = keras.layers.GRU( 10, return_state =True )(inp)
    #alternatively:
    gru_out = keras.layers.GRU( 10, return_sequences=True )(inp)
    
Defining a `Keras` model:  

    model = keras.model.Model( input=inp, outputs-gru_out )

Predicting with the `Keras` model:  

    x = np.random.normal( size = ( batchdim, timedim, inputdim ) )
    y = model.predict( x )
    print( "shape (y) =', y.shape, "\ny =\n", y )
    


In [121]:
#implement a simple model that has an input layer and a GRU layer. 
#You will then use the model to produce output values for a random input array.

import tensorflow.keras as keras
import numpy as np
# Define an input layer
inp = keras.layers.Input(batch_shape=(2,3,4))
# Define a GRU layer that takes in the input
gru_out = keras.layers.GRU(10)(inp)

# Define a model that outputs the GRU output
model = keras.models.Model(inputs=inp, outputs=gru_out)

x = np.random.normal(size=(2,3,4))
# Get the output of the model and print the result
y = model.predict(x)
print("shape (y) =", y.shape, "\ny = \n", y)

shape (y) = (2, 10) 
y = 
 [[ 0.2810379   0.03663985 -0.02324818 -0.0489272   0.04935639 -0.04420675
   0.02311645  0.20025104  0.00613629  0.00435218]
 [-0.19966927 -0.2918625   0.21411544 -0.04881026 -0.19061796  0.10403307
  -0.21819672 -0.131522    0.06279384  0.41709882]]


In [122]:
#see how you can use Keras models to accept arbitrary sized batches of inputs

# Define an input layer
inp = keras.layers.Input(shape=(3,4))
# Define a GRU layer that takes in the input
gru_out = keras.layers.GRU(10)(inp)
# Define a model that outputs the GRU output
model = keras.models.Model(inputs=inp, outputs=gru_out)

x1 = np.random.normal(size=(2,3,4))
x2 = np.random.normal(size=(5,3,4))

# Get the output of the model and print the result
y1 = model.predict(x1)
y2 = model.predict(x2)
print("shape (y1) = ", y1.shape, " shape (y2) = ", y2.shape)

shape (y1) =  (2, 10)  shape (y2) =  (5, 10)


In [123]:
# Define the Input layer
inp = keras.layers.Input(batch_shape=(3,25,5))
# Define a GRU layer that takes in inp as the input
gru_out1 = keras.layers.GRU(10)(inp)
print("gru_out1.shape = ", gru_out1.shape)

# Define the second GRU and print the shape of the outputs
gru_out2, gru_state = keras.layers.GRU(10, return_state=True)(inp)
print("gru_out2.shape = ", gru_out2.shape)
print("gru_state.shape = ", gru_state.shape)

# Define the third GRU layer which will return all the outputs
gru_out3 = keras.layers.GRU(10, return_sequences=True)(inp)
print("gru_out3.shape = ", gru_out3.shape)

gru_out1.shape =  (3, 10)
gru_out2.shape =  (3, 10)
gru_state.shape =  (3, 10)
gru_out3.shape =  (3, 25, 10)


<br>

## Implementing the Encoder/Decoder Model with `Keras`

### Implementing the Encoder

Understanding the Data:  

In [4]:
with open( 'vocab_fr.txt' ) as f:
    fr_text = f.readlines()
    
with open( 'vocab_en.txt' ) as f:
    en_text = f.readlines()

In [125]:
for en_sent, fr_sent in zip( en_text[:3], fr_text[:3]):
    print( 'ENglish: ', en_sent )
    print( 'Frnedch: ', fr_sent )

ENglish:  new jersey is sometimes quiet during autumn , and it is snowy in april .

Frnedch:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .

ENglish:  the united states is usually chilly during july , and it is usually freezing in november .

Frnedch:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .

ENglish:  california is usually quiet during march , and it is usually hot in june .

Frnedch:  california est généralement calme en mars , et il est généralement chaud en juin .



<br>

### Tokenizing the Sentences

Now to look at some of the attriutes of the DataSet  
**Tokenization** - the process of breaking a sentence/phrase to individual tokens  

In [126]:
first_sent = en_text[0]
print( 'first sentence: ', first_sent )
first_words = first_sent.split(' ')
print( '\tWords: ', first_words )

first sentence:  new jersey is sometimes quiet during autumn , and it is snowy in april .

	Words:  ['new', 'jersey', 'is', 'sometimes', 'quiet', 'during', 'autumn', ',', 'and', 'it', 'is', 'snowy', 'in', 'april', '.\n']


<br>

**Computing the average length of sentences**

In [127]:
sent_length = [len(text.split(' ')) for text in en_text]
mean_en_length = np.mean( sent_length )
print( 'ENGLIGH mean sentence length = ', mean_en_length)

sent_length = [len(text.split(' ')) for text in fr_text]
mean_fr_length = np.mean( sent_length )
print( 'FRENCH mean sentence length = ', mean_fr_length)

ENGLIGH mean sentence length =  13.225678224285508
FRENCH mean sentence length =  14.226737269693892


In [128]:
all_words = []
[all_words.extend( sent.split(' ')) for sent in en_text]
en_vocab_size = len( set( all_words ) )
print( 'ENGLISH vocab size = ', en_vocab_size )

all_words = []
[all_words.extend( sent.split(' ')) for sent in fr_text]
fr_vocab_size = len( set( all_words ) )
print( 'FRENCH vocab size = ', fr_vocab_size )

ENGLISH vocab size =  228
FRENCH vocab size =  357


<br>

**Implementing the Encoder with `Keras`  

Input Layer:  

    en_inputs = Input( shape=(en_len, en_vocab))
    
GRU Layer:  

    en_gru = GRU( hsize, return_state=True )
    en_out, en_state = en_gru( en_Inputs )
    
`Keras` Model:  

    encoder = Model( inputs=en_inputs, outputs=en_state )
    print( encoder.summary() )
    
<br>

In [129]:
# defining the Encoder

import tensorflow.keras as keras

en_len = 15
en_vocab = 228
hsize = 48

# Define an input layer
en_inputs = keras.layers.Input(shape=(en_len, en_vocab))
# Define a GRU layer which returns the state
en_gru = keras.layers.GRU(hsize, return_state = True)
# Get the output and state from the GRU
en_out, en_state = en_gru(en_inputs)
# Define and print the model summary
encoder = keras.models.Model(inputs=en_inputs, outputs=en_state)
print(encoder.summary() )

Model: "model_26"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_25 (InputLayer)        [(None, 15, 228)]         0         
_________________________________________________________________
gru_26 (GRU)                 [(None, 48), (None, 48)]  40032     
Total params: 40,032
Trainable params: 40,032
Non-trainable params: 0
_________________________________________________________________
None


<br>

### Implementing the Decoder

**Encoder-Decoder Model**  

* Encoder consumes the English words one-by-one
* Finally produces the context vector
* Decoder takes the context vector as the initial state
* Decoder produces French words one-by-one
* Decoder is implemented using a `Keras` GPU layer. GRU requires two inputs:
    1. a time series input
    2. a hidden state

How to produce the time series input for the GRU layer?  

1. repeat the context vetor from the encoder N-many times
    * ex: To produce a french sentence of 10 words, you repeat the context vector 10 times. 
    
Understanding the `RepeatVector` layer:  

* takes one argument which defines the sequence length of the required output
* takes in an input of (batch_size, input_size)
* output data will have the shape ( batch_size, sequence_length, input_size )

**Defining a `RepeatVector` layer**  

    from tensorflow.keras.layers import RepeatVector
    rep = RepeatVector( 5 )
    
    r_inp = Input( shape( 3, ) )
    r_out = rep( r_inp )
    
    repeat_model = Model( inputs= r_inp, outputs = r_out )
    
**Predicting with the Model**  

    x = np.array( [ [0,1,2], [3,4,5] ] )
    y = repeat_model.predict( x )
    print( 'x.shape = ', x.shape, '\ny.shape = ', y.shape )

**Implementing the Decoder**  

    de_inputs = RepeatVector( fr_len )( en_state )
    decoder_gru = GRU( hsize, return_sequences=True )
    gru_outputs = decoder_gru( de_inputs, initial_state=en_state )

**Defining the Model**  

    enc_dec = Model( inputs= en_inputs, outputs = gru_outputs )
    
<br>

In [130]:
# explore how the RepeatVector layer works
from tensorflow.keras.layers import Input, RepeatVector
from tensorflow.keras.models import Model
import numpy as np

inp = Input(shape=(2,))
# Define a RepeatVector that repeats the input 6 times
rep = RepeatVector(6)(inp)
# Define a model
model = Model(inputs=inp, outputs=rep)
# Define input x
x = np.array([[0,1], [2,3]])
# Get model prediction y
y = model.predict( x )
print('x.shape = ',x.shape,'\ny.shape = ',y.shape)


x.shape =  (2, 2) 
y.shape =  (2, 6, 2)


In [131]:
# implement the decoder and define an end-to-end model going from encoder inputs to the decoder GRU outputs. 

hsize = 48
fr_len = 15
# Define a RepeatVector layer
de_inputs = RepeatVector(fr_len)(en_state)
# Define a GRU model that returns all outputs
decoder_gru = keras.layers.GRU(hsize, return_sequences=True)
# Get the outputs of the decoder
gru_outputs = decoder_gru(de_inputs, initial_state=en_state)
# Define a model with the correct inputs and outputs
enc_dec = Model(inputs=en_inputs, outputs=gru_outputs)
enc_dec.summary()

Model: "model_28"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_25 (InputLayer)           [(None, 15, 228)]    0                                            
__________________________________________________________________________________________________
gru_26 (GRU)                    [(None, 48), (None,  40032       input_25[0][0]                   
__________________________________________________________________________________________________
repeat_vector_7 (RepeatVector)  (None, 15, 48)       0           gru_26[0][1]                     
__________________________________________________________________________________________________
gru_27 (GRU)                    (None, 15, 48)       14112       repeat_vector_7[0][0]            
                                                                 gru_26[0][1]              

<br>

### Dense and TimeDistributed Layers 

Introduction to the **Dense Layer** - a dense layer can be used to implement a fully-connected layer of a neural network.  

* Dense Layer takes an input vector and converts to a probabilistic prediction
    * y = Weightd.x + Bias  
    
Defining  Dense Laye with `Keras`:

    dense = keras.layers.Dense( vicab_size, activation = 'softmax' )
    inp = Input( shape=( vocab_size, )
    pred = dense( inp )
    model = Model( inputs=inp, outputs=pred )
    
Defining a Dense layer with custom initialization:

    from tensorflow.keras.initializers import RandomNormal
    init = RandomNormal( mean = 0.0, stddev = 0.05, seed = 6000 )
    dense = Dense( vocab_size, activation='softmax', kernel_initializer=init, bias_initializer=init )
    
Inputs and outputs of a Dense Layer:  

* Dense softmax layer
    * takes a (batch_size, input_size) array
    * produces a ( batch_size, num_classes ) array
    * output for each sample is a probability distribution over the classes which sums to 1
    * you can get the class of each sample using `np.argmax(y, axis=-1)`
    
Use a `TimeDistributed` layer as a wrapper for a `Dense` layer  

    dense_time = TimeDistributedd( Dense( vocab_size, activation='softmax' ) )
    inp = Input( shape = (  ) )
    pred = dense_time( inp )
    model = Model( inputs=inp, outputs=pred )
    
`TimeDistributed` Layer takes (batch_size, sequence_len, input_size) $\longrightarrow$ ( batch_size, sequence_len, num_classes ) array  

can get the class of each sample using `np.argmax( y, axis=-1 )`

Iterating through time-distributed data:

    for t in range( sequence_len ):
        for prob, c in zip( y[:,t,:], classes[:,t]):
            print( "prob: ', prob, ", Class: ', c )
            
<br>

In [132]:
init = keras.initializers.RandomNormal( mean = 0.0, stddev = 0.05, seed = 6000 )
# Define an input layer with batch size 3 and input size 3
inp = Input(batch_shape = (3,3))
# Get the output of the 3 node Dense layer
pred = keras.layers.Dense(3, activation='softmax', kernel_initializer=init, bias_initializer=init)(inp)
model = Model(inputs=inp, outputs=pred)

names = ["Mark", "John", "Kelly"]
prizes = ["Gift voucher", "Car", "Nothing"]
x = np.array([[5, 0, 1], [0, 3, 1], [2, 2, 1]])
# Compute the model prediction for x
y = model.predict(x)
# Get the most probable class for each sample
classes = np.argmax(y, axis=-1)
print("\n".join(["{} has probabilities {} and wins {}".format(n,p,prizes[c]) \
                 for n,p,c in zip(names, y, classes)]))

Mark has probabilities [0.3929537  0.37995604 0.22709025] and wins Gift voucher
John has probabilities [0.33233336 0.34169823 0.32596847] and wins Car
Kelly has probabilities [0.35587627 0.35802534 0.28609842] and wins Car


In [133]:
names = [['Mark', 'John', 'Kelly'], ['Jenny', 'Shan', 'Sarah']]
x = np.array([[[5, 0, 1],[1, 1, 0]],
           [[0, 3, 1],[0, 4, 0]],
           [[2, 2, 1],[6, 0, 1]]])
# Print names and x
print('names=\n',names, '\nx=\n',x, '\nx.shape=', x.shape)
inp = Input(shape=(2, 3))
# Create the TimeDistributed layer (the output of the Dense layer)
dense_time = keras.layers.TimeDistributed(keras.layers.Dense(3, activation='softmax', kernel_initializer=init, bias_initializer=init))
pred = dense_time(inp)
model = Model(inputs=inp, outputs=pred)

y = model.predict(x)
# Get the most probable class for each sample
classes = np.argmax(y, axis=-1)
for t in range(2):
  # Get the t-th time-dimension slice of y and classes
  for n, p, c in zip(names[t], y[:, t, :], classes[:, t]):
  	print("Game {}: {} has probs {} and wins {}\n".format(t+1,n,p,prizes[c]))

names=
 [['Mark', 'John', 'Kelly'], ['Jenny', 'Shan', 'Sarah']] 
x=
 [[[5 0 1]
  [1 1 0]]

 [[0 3 1]
  [0 4 0]]

 [[2 2 1]
  [6 0 1]]] 
x.shape= (3, 2, 3)
Game 1: Mark has probs [0.3929537  0.37995604 0.22709025] and wins Gift voucher

Game 1: John has probs [0.33233336 0.34169823 0.32596847] and wins Car

Game 1: Kelly has probs [0.35587627 0.35802534 0.28609842] and wins Car

Game 2: Jenny has probs [0.34050465 0.3426381  0.31685725] and wins Car

Game 2: Shan has probs [0.3069249  0.32335538 0.36971974] and wins Nothing

Game 2: Sarah has probs [0.3994818  0.38477215 0.21574609] and wins Gift voucher



<br>

### Implementing the Full Encoder/Decoder Model

still need a top part of the decoder.  
implement this with a `TimeDistributed` & `Dense` layer

![](encoder_decoder.png)  

Implementing the full model:  

    # The softmax prediction layer
    de_dense = keras.layers.Dense( fr_vocab_size, activation='softmax' )
    de_dense_time = keras.layers.TimeDistributed( de_dense )
    de_pred = de_seq_dense( de_out )
    
    # Defining the full model
    nmt = keras.models.Model( inputs = en_inputs, outputs = de_pred )
    
    # Compiling the model
    nmt.compile( optimizer='adam', loss='categorical_crossentropy`, metrics['acc'])
    
<br>

In [134]:
fr_vocab_size = 228
# Import Dense and TimeDistributed layers
from tensorflow.keras.layers import Dense, TimeDistributed
# Define a softmax dense layer that has fr_vocab outputs
de_dense = Dense(fr_vocab_size, activation='softmax')
# Wrap the dense layer in a TimeDistributed layer
de_dense_time = TimeDistributed(de_dense)
# Get the final prediction of the model
de_pred = de_dense_time(gru_outputs)
print("Prediction shape: ", de_pred.shape)

Prediction shape:  (None, 15, 228)


In [135]:
from tensorflow.keras.models import Model
# Define a model with encoder input and decoder output
nmt = Model(inputs=en_inputs, outputs=de_pred)

# Compile the model with an optimizer and a loss
nmt.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

# View the summary of the model 
nmt.summary()

Model: "model_31"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_25 (InputLayer)           [(None, 15, 228)]    0                                            
__________________________________________________________________________________________________
gru_26 (GRU)                    [(None, 48), (None,  40032       input_25[0][0]                   
__________________________________________________________________________________________________
repeat_vector_7 (RepeatVector)  (None, 15, 48)       0           gru_26[0][1]                     
__________________________________________________________________________________________________
gru_27 (GRU)                    (None, 15, 48)       14112       repeat_vector_7[0][0]            
                                                                 gru_26[0][1]              

<br>

## Training and Generating Translations

### Preprocessing Data

another look at the data:

In [136]:
for en_sent, fr_sent in zip( en_text[:3], fr_text[:3]):
    print( 'English sent: ', en_sent )
    print( '\fFrench sent: ', fr_sent )

English sent:  new jersey is sometimes quiet during autumn , and it is snowy in april .

French sent:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .

English sent:  the united states is usually chilly during july , and it is usually freezing in november .

French sent:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .

English sent:  california is usually quiet during march , and it is usually hot in june .

French sent:  california est généralement calme en mars , et il est généralement chaud en juin .



<br>

**1st Step: Word Tokenization**  - the process of breaking a sentence/phrase to individual words/characters  

Using the `Tokenizer()` object in `Keras`:  

* learns the mapping from word to word ID using a given corpus
* Can be used to convert a given string to a sequence of IDs  

Instantiating a `Tokenizer()`:  

    from tensorfloe.keras.preprocessing.text import Tokenizer
    en_tok = Tokenizer()
    
Fitting the Tokenizer on data

    en_tok = Tokenizer()
    en_tok.fit_on_texts( en_text )
    
    # getting the word to ID mapping
    id = en_tok.word_index[ "january" ]
    
    # getting the ID to word mapping
    w = en_tok.index_word[ 51 ]
    
    # Transforming sentences to sequences:  
    seq = en_tok.texts_to_sequences( [ 'she likes grapefruit, peaches, and lemons .' ] )  
    
Limiting the size of the vocabulary - you should not leave the tokenizer to do everything automatically. If you don't set up the tokenizer properly, it will learn many rare words in the dataset that are not powerful enough to improve the model. **out-of-vocabulary (OOV)** - words that are either rare or not present in the training set will be ignored by the tokenizer.

    tok = Tokenizer( num_words = 50 )
    
    # defining OOV tokens
    tok = Tokenizer( num_words=50, oov_token='UNK' )
    
<br>    

In [35]:
# tokenizing sentences with Keras

from tensorflow.keras.preprocessing.text import Tokenizer

# Define a Keras Tokenizer
en_tok = Tokenizer()
fr_tok = Tokenizer()

# Fit the tokenizer on some text
en_tok.fit_on_texts( en_text )
fr_tok.fit_on_texts( fr_text )

for w in ["january", "apples", "summer"]:
  # Get the word ID of word w
  id = en_tok.word_index[w]
  # Print the word and the word ID
  print(w, " has id: ", id)

january  has id:  36
apples  has id:  75
summer  has id:  46


In [36]:
# controlling the vocabulary with the Tokenizer
# convert an arbitrary sentence to a sequence using a trained Tokenizer

# Convert the sentence to a word ID sequence
seq = en_tok.texts_to_sequences(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence: ', seq)

# Define a tokenizer with vocabulary size 50 and oov_token 'UNK'
en_tok_new = Tokenizer(num_words=100, oov_token='UNK')

# Fit the tokenizer on en_text
en_tok_new.fit_on_texts(en_text)

# Convert the sentence to a word ID sequence
seq_new = en_tok_new.texts_to_sequences(['she likes grapefruit , peaches , and lemons .'])
print('Word ID sequence (with UNK): ', seq_new)
print('The ID 1 represents the word: ', en_tok_new.index_word[1])

Word ID sequence:  [[27, 70, 28, 76, 7, 72]]
Word ID sequence (with UNK):  [[28, 71, 29, 77, 8, 73]]
The ID 1 represents the word:  UNK


<br>

### Processing the Text

1. Adding special starting/ending tokens to target sentences
2. Padding the sentences such that they all have the same length
3. Reversing sentences - helps to make a stronger connection between the encoder & decoder

an example of sentence padding:   

In [10]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = [
    'new jersey is sometimes quiet during autumn .',
    'california is never rainy during july , but it is sometimes beautiful in february .'
]

seqs = en_tok.texts_to_sequences( sentences )
preproc_text = pad_sequences( seqs, padding='post', truncating= 'post', maxlen = 12 )
for orig, padded in zip( seqs, preproc_text ):
    print( orig, ' => ', padded )

[17, 23, 1, 8, 67, 4, 39]  =>  [17 23  1  8 67  4 39  0  0  0  0  0]
[22, 1, 10, 63, 4, 43, 6, 3, 1, 8, 53, 2, 48]  =>  [22  1 10 63  4 43  6  3  1  8 53  2]


an example of sentence reversal:

In [140]:
pad_seq = list( preproc_text[1] )
pad_seq = pad_seq[::-1]
rev_sent = [ en_tok.index_word[wid] for wid in pad_seq[-6:]]
print( 'Sentences: ', sentences[1] )
print( '\tReversed: ',' '.join( rev_sent ) )

Sentences:  california is never rainy during july , but it is sometimes beautiful in february .
	Reversed:  july during rainy never is california


In [141]:
fr_text_new = []

# Loop through all sentences in fr_text
for sent in fr_text:  
  # Add sos and eos tokens using string.join
  sent_new = " ".join(['sos', sent, 'eos'])
  # Append the modified sentence to fr_text_new
  fr_text_new.append(sent_new)

    
# Print sentence after adding tokens
print("After adding tokens: ", sent_new, '\n')

After adding tokens:  sos l'orange est son fruit préféré , mais la banane est votre favori .
 eos 



In [142]:
# function to transform data conveniently to the format accepted 
#by the neural machine translation (NMT) model.
en_len = 15
en_vocab = 228

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

def sents2seqs(input_type, sentences, onehot=False, pad_type='post'):
    # Convert sentences to sequences      
    encoded_text = en_tok.texts_to_sequences(sentences)
    # Pad sentences to en_len
    preproc_text = pad_sequences(encoded_text, padding=pad_type, truncating='post', maxlen=en_len)
    if onehot:
        # Convert the word IDs to onehot vectors
        preproc_text = to_categorical(preproc_text, num_classes=en_vocab)
    return preproc_text
sentence = 'she likes grapefruit , peaches , and lemons .'  
# Convert a sentence to sequence by pre-padding the sentence
pad_seq = sents2seqs('source', [sentence], pad_type='pre')
print( pad_seq )

[[ 0  0  0  0  0  0  0  0  0 27 70 28 76  7 72]]


In [11]:
# reverse sentences for the encoder model
# modify the sents2seqs function to reverse sentences
sentences = ["california is never rainy during july ."]

# Add new keyword parameter reverse which defaults to False
def sents2seqs(input_type, sentences, onehot=False, pad_type='post', reverse=False):     
    encoded_text = en_tok.texts_to_sequences(sentences)
    preproc_text = pad_sequences(encoded_text, padding=pad_type, truncating='post', maxlen=en_len)
    if reverse:
      # Reverse the text using numpy axis reversing
      preproc_text = preproc_text[:, ::-1]
    if onehot:
        preproc_text = to_categorical(preproc_text, num_classes=en_vocab)
    return preproc_text


# Call sents2seqs to get the padded and reversed sequence of IDs
pad_seq = sents2seqs('source', sentences, reverse=True)
rev_sent = [en_tok.index_word[wid] for wid in pad_seq[0][-6:]] 
print('\tReversed: ',' '.join(rev_sent))

	Reversed:  july during rainy never is california


<br>

### Training the NMT Model

Revisiting the model architecture:  

* Encoder GRU
    * consumes English words
    * outputs a contect vector
* Decoder GRU
    * consumes the context vector
    * outputs a sequence of GRU outputs
* Decored prediction layer
     * consumes the sequence of GRU outputs
     * ouputs prediction probability for French words
     
Optimizing Model Parameters:  

* often represented as `W` (weights) and `b` (bias) - these are initialized as random values
* responsible for transforming a given input to a useful output
* Changed over time to minimize a given loss using an optimizer
    * **Loss** - computed as the difference between:
        * the predictions (French words generated from the model)
        * the actual outputs ( actual French words )
* Inform the model during model compilation

model compilation:

    nmt.compile( optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['acc'] )
    
Training the Model:  
training the model involves iterating through the data in batches.  
**batches** - a single iteration  
**epochs** - a single traverse through all the data set  

Training Iterations:  

    for ei in range( n_epochs ): # single traverse through the dataset
        for i in range( 0, data_size, batch_size ) # process a single batch
        
        # obtain a batch or training data
        en_x = sents2seqs( 'source', en_text[ i:i+batch_size ], onehot=True, reverse=True )
        de_y = sents2seqs( 'target', en_text[ i:i+batch_size ], onehot=True )
        
        # train on a single batch of data
        nmt.train_on_batch( en_x, de_y )
        
        # evaluate the model
        res = nmt.evaluate( en_x, de_y, batch_size=batch_size, verbose=0 )
        print( "Epoch {} => Train Loss:{}, Train Acc: {}".format( ei+1, res[0], res[1]*100.0 )
        
Avoiding Overfitting  

* Break the dataset into two parts:
    - Training
    - Validation
* When the validation accuracy stops increasing, stop the training

Splitting the Dataset:  

    # define the train and validation datasets
    train_size, valid_size = 800, 200
    inds = np.arange( len( en_text ) )
    np.random.shuffle( inds ) 
    
    # get the train & validation indices
    train_inds = inds[ :train_size ]
    valid_inds = inds[ train_size : train_size+valid_size ] 
    
    # splitting the dataset
    tr_en = [ en_text[ ti ] for ti in train_inds ]
    tr_fr = [ fr_text[ ti ] for ti in train_inds ]
    v_en = [ en_text[ ti ] for ti in valid_inds ]
    v_en = [ en_text[ ti ] for ti in valid_inds ]
    
Training the Model with Validation

    for ei in range( n_epochs ): # single traverse through the dataset
        for i in range( 0, data_size, batch_size ) # process a single batch
        
        # obtain a batch or training data
        en_x = sents2seqs( 'source', tr_en[ i:i+batch_size ], onehot=True, reverse=True )
        de_y = sents2seqs( 'target', te_fr[ i:i+batch_size ], onehot=True )
        
        # train on a single batch of data
        nmt.train_on_batch( en_x, de_y )
        
    v_en_x = sents2seqs( 'source', v_en, onehot=True, padtype='pre' )
    v_de_y = sents2seqs( 'target', v_fr, onehot=True )
        
    # evaluate the model
    res = nmt.evaluate( v_en_x, dv_e_y, batch_size=batch_size, verbose=0 )
    print( "Epoch {} => Train Loss:{}, Train Acc: {}".format( ei+1, res[0], res[1]*100.0 )
    
<br>

In [164]:
# splitting the data into training and validation sets
train_size, valid_size = 80000, 20000
# Define a sequence of indices from 0 to len(en_text)
inds = np.arange(len(en_text))
np.random.shuffle(inds)
train_inds = inds[:train_size]
# Define valid_inds: last valid_size indices
valid_inds = inds[train_size: train_size+valid_size]
# Define tr_en (train EN sentences) and tr_fr (train FR sentences)
tr_en = [en_text[ti] for ti in train_inds]
tr_fr = [fr_text[ti] for ti in train_inds]
# Define v_en (valid EN sentences) and v_fr (valid FR sentences)
v_en = [en_text[vi] for vi in valid_inds]
v_fr = [fr_text[vi] for vi in valid_inds]
print('Training (EN):\n', tr_en[:3], '\nTraining (FR):\n', tr_fr[:3])
print('\nValid (EN):\n', v_en[:3], '\nValid (FR):\n', v_fr[:3])

Training (EN):
 ['china is sometimes cold during spring , and it is never chilly in may .\n', 'california is busy during winter , and it is usually warm in january .\n', 'california is never rainy during june , but it is cold in july .\n'] 
Training (FR):
 ['la chine est parfois froid au printemps , et il est jamais froid en mai .\n', "californie est occupé pendant l' hiver , et il est habituellement chaud en janvier .\n", 'california est jamais pluvieux en juin , mais il fait froid en juillet .\n']

Valid (EN):
 ['california is usually mild during october , and it is usually relaxing in february .\n', 'they dislike limes , oranges , and bananas.\n', 'california is never quiet during summer , but it is never pleasant in july .\n'] 
Valid (FR):
 ['californie est généralement doux en octobre , et il est relaxant habituellement en février .\n', "ils n'aiment pas , les oranges , citrons verts et les bananes .\n", "california est jamais calme pendant l' été , mais il est jamais agréable en 

In [165]:
# Convert validation data to onehot
v_en_x = sents2seqs('source', v_en, onehot=True, reverse=True)
v_de_y = sents2seqs('target', v_fr, onehot=True)

n_epochs, bsize = 3, 250
for ei in range(n_epochs):
  for i in range(0,train_size,bsize):
    # Get a single batch of inputs and outputs
    en_x = sents2seqs('source', tr_en[i:i+bsize], onehot=True, reverse=True)
    de_y = sents2seqs('target', tr_fr[i:i+bsize], onehot=True)
    # Train the model on a single batch of data
    nmt.train_on_batch(en_x, de_y)    
  # Evaluate the trained model on the validation data
  res = nmt.evaluate(v_en_x, v_de_y, batch_size=valid_size, verbose=0)
  print("{} => Loss:{}, Val Acc: {}".format(ei+1,res[0], res[1]*100.0))

1 => Loss:0.12522836029529572, Val Acc: 96.0669994354248
2 => Loss:0.06793607771396637, Val Acc: 98.09200167655945
3 => Loss:0.029000338166952133, Val Acc: 99.44133162498474


In [160]:
nmt.summary()

Model: "model_31"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_25 (InputLayer)           [(None, 15, 228)]    0                                            
__________________________________________________________________________________________________
gru_26 (GRU)                    [(None, 48), (None,  40032       input_25[0][0]                   
__________________________________________________________________________________________________
repeat_vector_7 (RepeatVector)  (None, 15, 48)       0           gru_26[0][1]                     
__________________________________________________________________________________________________
gru_27 (GRU)                    (None, 15, 48)       14112       repeat_vector_7[0][0]            
                                                                 gru_26[0][1]              

<br>

### Generating Translation with the NMT

Motivation: we have a trained NMT model, but how can we use it to generate translations? 

try with an exampe sentence:

In [166]:
en_st = ['the united states is sometimes chilly during december , but it is sometimes freezing in june .']
en_seq = sents2seqs( 'source', en_st, onehot=True, reverse=True )
print( np.argmax( en_seq, axis=-1 ))

[[34  2 51  8  1  3  6 47  4 62  8  1 21 20  5]]


In [167]:
# obtain the french translation
fr_pred = nmt.predict( en_seq )
print( fr_pred.shape )
fr_seq = np.argmax( fr_pred, axis=-1)[0]
print( fr_seq )

(1, 15, 228)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [168]:
fr_sentence = ' '.join([fr_tok.index_word[i] for i in fr_seq if i != 0 ] )
fr_sentence

''

<br>

## Teacher Forcing and Word Embeddings

### Introduction to Teacher Forcing

**Teacher Forcing** - the teacher guides the translation at each step to help the model learn faster  

Implementing the model with Teacher Forcing:

    # Encoder
    en_inputs  = layers.Input( shape=(en_len, en_vocab))
    en_gru - layers.GRU( hsize, return_state=True )
    en_out, en_state = en_gru( en_inputs )
    
    # Decoder
    de_inputs = layers.Input( shape= (fr_len-1, fr_vocab) )
    de_gru = layers.GRU( hsize, return_sequences=True )
    de_out = de_gru( de_inputs, initial_state=en_state )
    
    # Decoder Prediction
    de_dense = layers.TimeDistributed( layers.Dense( fr_vocab, activation='softmax' ) )
    de_pred = de_dense( de_out )
    
    # Compiling the Model
    nmt_tf = Model( inputs=[en_inputs, de_inputs], outputs = de_pred )
    nmt_tf.compile( optimizer='adam', loss='categorical_crossentropy', metrics=['acc'] )
    
Preprocessing the Data:  

    # Encoder
    # Inputs - all english words (onehot encoded)
    en_x = sent2seqs( 'source', en_text, onehot=True, reverse=True )
    
    # Decoder
    de_xy = sents2seqs( 'target', fr_text, onehot=True )
    # Inputs - all french words except the last (onehot encoded)
    de_x = de_xy[:,:-1,:]
    # Outputs/Targets - all french words except the first word (onehot encodedd)
    de_y = de_xy[:,1:,:]
    
<br>

In [46]:
def word2onehot(tokenizer, word, vocab_size):
    de_seq = tokenizer.texts_to_sequences([[word]])
    de_onehot = to_categorical(de_seq, num_classes=vocab_size)
    de_onehot = np.expand_dims(de_onehot, axis=1)    
    return de_onehot

def probs2word(probs, tok):
    wid = np.argmax(probs[0,:], axis=-1)
    w = tok.index_word[wid]
    return w

def sents2seqs(input_type, sentences, onehot=False, pad_type='post', reverse=False):
    assert input_type in ["source", "target"]
    if input_type == 'source':
      tokenizer = en_tok
      pad_length = en_len
      vocab_size = en_vocab
    elif input_type == 'target':
      tokenizer = fr_tok
      pad_length = fr_len
      vocab_size = fr_vocab
    
    encoded_text = tokenizer.texts_to_sequences(sentences)
    preproc_text = pad_sequences(encoded_text, padding=pad_type, maxlen=pad_length)
    if reverse:
      preproc_text = preproc_text[:,::-1]
      
    if onehot:
        assert vocab_size is not None, "Cannot do to_categorical without num_classes for safety"
        preproc_text = to_categorical(preproc_text, num_classes=vocab_size)
    return preproc_text

In [52]:
# defining the Teacher Forcing model layers

en_len = 20
fr_len = 25
en_vocab = 200
fr_vocab = 350 #357
hsize = 64


# Import the layers submodule from keras
import tensorflow.keras.layers as layers

en_inputs = layers.Input(shape=(en_len, en_vocab))
en_gru = layers.GRU(hsize, return_state=True)
# Get the encoder output and state
en_out, en_state = en_gru(en_inputs)

# Define the decoder input layer
de_inputs = layers.Input(shape=(fr_len-1, fr_vocab))
de_gru = layers.GRU(hsize, return_sequences=True)
de_out = de_gru(de_inputs, initial_state=en_state)
# Define a TimeDistributed Dense softmax layer with fr_vocab nodes
de_dense = layers.TimeDistributed(layers.Dense(fr_vocab, activation='softmax'))
de_pred = de_dense(de_out)

In [53]:
# define the Teacher Forcing model

# Import the Keras Model object
from tensorflow.keras.models import Model

# Define a model
nmt_tf = Model(inputs=[en_inputs, de_inputs], outputs=de_pred)
# Compile the model with optimizer and loss
nmt_tf.compile(optimizer='adam', loss='categorical_crossentropy', metrics=["acc"])
# Print the summary of the model
nmt_tf.summary()

Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_15 (InputLayer)           [(None, 20, 200)]    0                                            
__________________________________________________________________________________________________
input_16 (InputLayer)           [(None, 24, 350)]    0                                            
__________________________________________________________________________________________________
gru_14 (GRU)                    [(None, 64), (None,  51072       input_15[0][0]                   
__________________________________________________________________________________________________
gru_15 (GRU)                    (None, 24, 64)       79872       input_16[0][0]                   
                                                                 gru_14[0][1]               

In [54]:
# Preprocessing the Data

#Define a tokenizer with vocabulary size 50 and oov_token 'UNK'
#en_tok_new = Tokenizer(num_words=en_vocab, oov_token='UNK')
#fr_tok_new = Tokenizer(num_words=fr_vocab, oov_token='UNK')


bsize = 100
for i in range(0, len(en_text), bsize):
  # Get the encoder inputs using the sents2seqs() function
  en_x = sents2seqs('source', en_text[i:i+bsize], onehot=True, reverse=True)
  # Get the decoder inputs/outputs using the sents2seqs() function
  de_xy = sents2seqs('target', fr_text[i:i+bsize], onehot=True)
  # Separate the decoder inputs from de_xy
  de_x = de_xy[:,:-1,:]
  # Separate the decoder outputs from de_xy
  de_y = de_xy[:,1:,:]
  
  #print("Data from ", i, " to ", i+bsize)
  #print("\tnp.argmax() => en_x[0]: ", np.argmax(en_x[0], axis=-1))
  #print("\tnp.argmax() => de_x[0]: ", np.argmax(de_x[0], axis=-1))
  #print("\tnp.argmax() => de_y[0]: ", np.argmax(de_y[0], axis=-1))

<br>

### Training the Model with Teacher Forcing

Training requires:

* A loss function ('categotical_crossentropy'). to compute the loss, the following are required:
    * probabilistic predictions generated using inputs ([batch_size, seq_len. vocab_size])
    * actual onehot encoded french words ([batch_size, seq_len. vocab_size])
    * cross entropy is found as the difference between the targets and predicted words
* An optimizer ('adam') - loss is passed an optimizer like 'adam' which will change the model parameters to minimize the loss each time the train on batch is called. gradually improving the model parameters.

Training the Model:  

    n_epochs, bsize = 3, 250
    
    for ei in range( n_epochs ):
        for i in range( 0, data_size, bsize ):
            # Encoder inputs, decoder inputs and outputs
            en_x = sents2seqs( 'source`, en_text[ i:i+bsize ], onehot=True, reverse=True )
            de_xy = sents2seqs( 'target', fr_text[ i:i+bsize ], onehot=True )
            # Separating decoder inputs and outputs
            de_x = de_xy[:.:-1,:]
            de_y = de_xy[:,1:,:]
            # Training and evaluating on a single batch
            nmt_tf.train_on_batch( [en_x, de_x], de_y )
            res = nmt_tf.evaluate( [en_x, de_x], de_y, batch_size=bsize, verbose=0 )
            print( "{} => Train Loss:{}, Train Acc:{}".format( ei+1, res[0], res[1]*100.0))
            
Creating Training and Validation Data:

    # define the train and validation datasets
    train_size, valid_size = 800, 200
    inds = np.arange( len( en_text ) )
    np.random.shuffle( inds ) 

    # get the train & validation indices
    train_inds = inds[ :train_size ]
    valid_inds = inds[ train_size : train_size+valid_size ] 

    # splitting the dataset
    tr_en = [ en_text[ ti ] for ti in train_inds ]
    tr_fr = [ fr_text[ ti ] for ti in train_inds ]
    v_en = [ en_text[ ti ] for ti in valid_inds ]
    v_en = [ en_text[ ti ] for ti in valid_inds ]
    
Training with Validation

    n_epochs, bsize = 3, 250
    
    for ei in range( n_epochs ):
        for i in range( 0, data_size, bsize ):
            # Encoder inputs, decoder inputs and outputs
            en_x = sents2seqs( 'source`, en_text[ i:i+bsize ], onehot=True, reverse=True )
            de_xy = sents2seqs( 'target', fr_text[ i:i+bsize ], onehot=True )
            # Separating decoder inputs and outputs
            de_x = de_xy[:.:-1,:]
            de_y = de_xy[:,1:,:]
            # Training and evaluating on a single batch
            nmt_tf.train_on_batch( [en_x, de_x], de_y )
        v_en_x = sents2seqs( 'source', v_en, onehot=True, reverse=True )
        v_de_xy = sents2seqs( 'target', v_fr, onehot=True )
        v_de_x, v_de_y = v_de_xy[:,:-1,:], v_de_xy[:,1:,:]
        res = nmt_tf.evaluate( [v_en_x, v_de_x], v_de_y, batch_size=bsize, verbose=0 )
        print( "{} => Loss:{}, Validation Acc:{}".format( ei+1, res[0], res[1]*100.0))
 

In [55]:
import numpy as np
# splitting training and validation sets

train_size, valid_size = 800, 200
# Define a sequence of indices from 0 to size of en_text
inds = np.arange(len(en_text))
np.random.shuffle(inds)
# Define train_inds as first train_size indices
train_inds = inds[:train_size]
valid_inds = inds[train_size:train_size+valid_size]
# Define tr_en (train EN sentences) and tr_fr (train FR sentences)
tr_en = [en_text[ti] for ti in train_inds]
tr_fr = [fr_text[ti] for ti in train_inds]
# Define v_en (valid EN sentences) and v_fr (valid FR sentences)
v_en = [en_text[vi] for vi in valid_inds]
v_fr = [fr_text[vi] for vi in valid_inds]
print('Training (EN):\n', tr_en[:3], '\nTraining (FR):\n', tr_fr[:3])
print('\nValid (EN):\n', v_en[:3], '\nValid (FR):\n', v_fr[:3])

Training (EN):
 ['he likes a big blue truck .\n', 'new jersey is usually chilly during february , but it is quiet in fall .\n', 'france is sometimes rainy during autumn , and it is usually snowy in march .\n'] 
Training (FR):
 ['il aime un gros camion bleu .\n', "new jersey est généralement froid en février , mais il est calme à l' automne .\n", 'la france est parfois pluvieux en automne , et il est généralement enneigée en mars .\n']

Valid (EN):
 ['the united states is never nice during april , but it is wet in october .\n', 'you dislike grapefruit , lemons , and pears .\n', 'india is nice during may , but it is freezing in february .\n'] 
Valid (FR):
 ['les états-unis est jamais agréable en avril , mais il est humide en octobre .\n', "vous n'aimez pas pamplemousses , les citrons et les poires .\n", "l' inde est agréable au mois de mai , mais il gèle en février .\n"]


In [56]:
# training the model with validation
n_epochs, bsize = 3, 250
for ei in range(n_epochs):
    for i in range(0,train_size,bsize):
        en_x = sents2seqs('source', tr_en[i:i+bsize], onehot=True, reverse=True)
        de_xy = sents2seqs('target', tr_fr[i:i+bsize], onehot=True)
        # Create a single batch of decoder inputs and outputs
        de_x, de_y = de_xy[:,:-1,:], de_xy[:,1:,:]
        # Train the model on a single batch of data
        nmt_tf.train_on_batch([en_x,de_x], de_y)      
    v_en_x = sents2seqs('source', v_en, onehot=True, reverse=True)
    # Create a single batch of validation decoder inputs and outputs
    v_de_xy = sents2seqs('target', v_fr, onehot=True)
    v_de_x, v_de_y = v_de_xy[:,:-1,:], v_de_xy[:,1:,:]
    # Evaluate the trained model on the validation data
    res = nmt_tf.evaluate([v_en_x,v_de_x], v_de_y, batch_size=valid_size, verbose=0)
    print("{} => Loss:{}, Val Acc: {}".format(ei+1,res[0], res[1]*100.0))

1 => Loss:5.795658111572266, Val Acc: 48.95833432674408
2 => Loss:5.742404937744141, Val Acc: 54.83333468437195
3 => Loss:5.6771697998046875, Val Acc: 54.6875


<br>

### Generating Translations from the Model

Decoder of the Inference Model:  

* Takes in:
    - a onehot encoded word
    - a state input
* Produces:
    - a new state
    - a prediction
Recursively feed the predicted word and the state back to the model as input  
the `sos` marking the beginning og the translation as the first word to the decoder  
the `eos` marks the end of translation  
as a safety measure use a maximum length the model can predict for

Defining the Generator enccoder:  

    # import keras layers and model
    import tensorflow.keras.layers as layers
    from tensorflow.keras.models import Model
    
    # Defining model layers
    en_inputs = layers.Input( shape=( en_len, en_vocab ) )
    en_gru = layers.GRU( hsize, return_state=True )
    en_out, en_state = en_gru( en_inputs )
    
    # Defining Model object
    encoder = Model( inputs= en_inputs, outputs= en_state )
    
Defining the Generator Decoder:  

    # defining decoder Input layers
    de_inputs = layers.Input( shape=(1,fr_vocab) )
    de_state_in = layers.Input( shape=(hsize,) )
    
    # defining the decoders interim layers
    de_gru = layers.GRU( hsize, return_state=True )
    de_out, de_state_out = de_gru( de_inputs, initial_state=de_state_in )
    de_dense = layers.Dense( fr_vocab, activation='softmax' )
    de_pred = de_dense( de_out )
    
    # defining decoder model
    decoder = Model( inputs=[ de_inputs, de_state_in ], outputs=[de_pred, de_state_out ] )
    
Important: need to copy the weights from Encoder GRU, Decoder GRU and Decoder Dense layers from the trained model to the prediction model  

Generating Translation:  

    en_sent['the united states is sometimes chilly during december , but is sometimes freezing in june .']
    
    # convert to a sequence
    en_seq = sents2seqs( 'source', en_st, onehot=True, reverse=True )
    
    # get the context vector
    de_s_t = encoder.predict( en_st )
    
    # converting 'sos' to a sequence
    de_seq = word2onehot( fr_tok, 'sos', fr_vocab )
    
    #generating translation
    fr_sent = ''
    for _ in range( fr_len ):
        de_prob, de_s_t = decoder.predict( [de_seq,de_s_t] )
        de_w = probs2word( de_prob, fr_tok )
        de_seq = word2onehot( fr_tok, de_w, fr_vocab )
        if de_w == 'eos': break
        fr_sent += de_w + ' '
        
<br>

In [None]:
import tensorflow.keras.layers as layers
from tensorflow.keras.models import Model
# Define an input layer that accepts a single onehot encoded word
de_inputs = layers.Input(shape=(1, fr_vocab))
# Define an input to accept the t-1 state
de_state_in = layers.Input(shape=(hsize,))
de_gru = layers.GRU(hsize, return_state=True)
# Get the output and state from the GRU layer
de_out, de_state_out = de_gru(de_inputs, initial_state=de_state_in)
de_dense = layers.Dense(fr_vocab, activation='softmax')
de_pred = de_dense(de_out)

# Define a model
decoder = Model(inputs=[de_inputs, de_state_in], outputs=[de_pred, de_state_out])
print(decoder.summary())

In [None]:
# an example of linking the trained model weights with the inference model

# Load the weights to the encoder GRU from the trained model
en_gru_w = tr_en_gru.get_weights()
# Set the weights of the encoder GRU of the inference model
en_gru.set_weights(en_gru_w)
# Load and set the weights to the decoder GRU
de_gru.set_weights(tr_de_gru.get_weights())
# Load and set the weights to the decoder Dense
de_dense.set_weights(tr_de_dense.get_weights())

In [None]:
en_sent = ['the united states is sometimes chilly during december , but it is sometimes freezing in june .']
print('English: {}'.format(en_sent))
en_seq = sents2seqs('source', en_sent, onehot=True, reverse=True)
# Predict the initial decoder state with the encoder
de_s_t = encoder.predict(en_seq)
de_seq = word2onehot(fr_tok, 'sos', fr_vocab)
fr_sent = ''
for i in range(fr_len):    
  # Predict from the decoder and recursively assign the new state to de_s_t
  de_prob, de_s_t = decoder.predict([de_seq,de_s_t])
  # Get the word from the probability output using probs2word
  de_w = probs2word(de_prob, fr_tok)
  # Convert the word to a onehot sequence using word2onehot
  de_seq = word2onehot(fr_tok, de_w, fr_vocab)
  if de_w == 'eos': break
  fr_sent += de_w + ' '
print("French (Ours): {}".format(fr_sent))
print("French (Google Translate): les etats-unis sont parfois froids en décembre, mais parfois gelés en juin")

<br>

### Using Word Embeddings for Machine Translation

finding the cosine similarity between word vectors:  

    from sklearn.metrics.pairwise import cosine_similarity
    
Implementing embeddings for the encoder:  

    en_inputs = Input( shape=( en_len, ) )
    en_emb = Embedding( en_vocab, embedding_size, input_length=en_len )( en_inputs )
    en_out, en_state = GRU( hsize, return_state=True )( en_emb )
    
Implementing the decoder with Embedding:  

    de_inputs = Input( shape=( fr_len-1, ) )
    de_emb = Embedding( fr_vocab, embedding_size, input_length=fr_len-1 )( de_inputs )
    de_out, _ = GRU( hsize, return_state=True, return_state=True )( de_emb, inititl_state=en_state )
    
Training the model:  

    n_epochs, bsize = 3, 250

    for ei in range( n_epochs ):
        for i in range( 0, train_size, bsize ):
            # Encoder inputs, decoder inputs and outputs
            en_x = sents2seqs( 'source`, tr_en[ i:i+bsize ], onehot=True, reverse=True )
            de_xy = sents2seqs( 'target', tr_fr[ i:i+bsize ], onehot=True )
            # Separating decoder inputs and outputs
            de_x = de_xy[:.:-1,:]
            de_xy_oh = sents2seqs( 'target', tr_fr[i:i+bsize], onehot=True)
            de_y = de_xy_oh[:,1:,:]
            # Training and evaluating on a single batch
            nmt_emb.train_on_batch( [en_x, de_x], de_y )
            res = nmt_emb.evaluate( [en_x, de_x], de_y, batch_size=bsize, verbose=0 )
            print( "{} => Train Loss:{}, Train Acc:{}".format( ei+1, res[0], res[1]*100.0))

In [60]:
# defining an embedding model

# Define an input layer which accepts a sequence of word IDs
en_inputs = layers.Input(shape=(en_len,))
# Define an Embedding layer which accepts en_inputs
en_emb = layers.Embedding(fr_vocab, 96, input_length=en_len)(en_inputs)
en_out, en_state = layers.GRU(hsize, return_state=True)(en_emb)

de_inputs = layers.Input(shape=(fr_len-1,))
# Define an Embedding layer which accepts de_inputs
de_emb = layers.Embedding(fr_vocab, 96, input_length=fr_len-1)(de_inputs)
de_out, _ = layers.GRU(hsize, return_sequences=True, return_state=True)(de_emb, initial_state=en_state)
de_pred = layers.TimeDistributed(layers.Dense(fr_vocab, activation='softmax'))(de_out)

# Define the Model which accepts encoder/decoder inputs and outputs predictions 
nmt_emb = Model([en_inputs, de_inputs], de_pred)
nmt_emb.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

nmt_emb.summary()

Model: "model_8"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_20 (InputLayer)           [(None, 20)]         0                                            
__________________________________________________________________________________________________
input_21 (InputLayer)           [(None, 24)]         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 20, 96)       33600       input_20[0][0]                   
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 24, 96)       33600       input_21[0][0]                   
____________________________________________________________________________________________

In [62]:
# training the word embedding model

for ei in range(10):
  for i in range(0, train_size, bsize):    
    en_x = sents2seqs('source', tr_en[i:i+bsize], onehot=False, reverse=True)
    # Get a single batch of French sentences with no onehot encoding
    de_xy = sents2seqs('target', tr_fr[i:i+bsize], onehot=False)
    # Get all words except the last word in that batch
    de_x = de_xy[:,:-1]
    de_xy_oh = sents2seqs('target', tr_fr[i:i+bsize], onehot=True)
    # Get all words except the first from de_xy_oh
    de_y = de_xy_oh[:,1:,:]
    # Training the model on a single batch of data
    nmt_emb.train_on_batch([en_x,de_x], de_y)    
    res = nmt_emb.evaluate([en_x, de_x], de_y, batch_size=bsize, verbose=0)
    print("{} => Loss:{}, Train Acc: {}".format(ei+1,res[0], res[1]*100.0))

1 => Loss:5.59381628036499, Train Acc: 54.366666078567505
1 => Loss:5.555060386657715, Train Acc: 53.64999771118164
1 => Loss:5.50683069229126, Train Acc: 53.64999771118164
1 => Loss:5.423998832702637, Train Acc: 56.25
2 => Loss:5.372900485992432, Train Acc: 53.94999980926514
2 => Loss:5.296853065490723, Train Acc: 53.38333249092102
2 => Loss:5.19899320602417, Train Acc: 52.88333296775818
2 => Loss:5.029825210571289, Train Acc: 55.916666984558105
3 => Loss:4.914877414703369, Train Acc: 53.64999771118164
3 => Loss:4.7456254959106445, Train Acc: 52.99999713897705
3 => Loss:4.529751300811768, Train Acc: 52.86666750907898
3 => Loss:4.191209316253662, Train Acc: 55.75000047683716
4 => Loss:3.971393585205078, Train Acc: 52.666664123535156
4 => Loss:3.713937520980835, Train Acc: 52.016669511795044
4 => Loss:3.467141628265381, Train Acc: 51.866668462753296
4 => Loss:3.1483168601989746, Train Acc: 55.33333420753479
5 => Loss:3.072063684463501, Train Acc: 52.666664123535156
5 => Loss:2.987277269

<br>

## Summary

This tutorial worked through several different Machine Translation approaches ordered by increased complexity:  

1. Model 1: NMT  
    - encoder consumes english onehot encoded english words and returns a context vector
    - decoder consumes the contect vector and outputs a translation
2. Model 2: NMT + Teacher Forcing
    - encoder consumes english onehot encoded english words and returns a context vector
    - decoder consumes a given onehot encoded word of the translation and predicts the next word
3. Model 3: NMT + TF + Embedding
    - encoder uses word embeddings that capture the semantice relationships between words
    - decoder consumes the embedding and returns a translation of the next word
    
Other developments:  

* BLUE score
* Word piece model
* Transformer models - uses attention not sequential models