# Machine Translation in `Python`

(Source Language) $\longrightarrow$ (Target Language)  

One-hot encoded vectors:  

* a sparse vector of ones and zeros
    * 1: token is present
    * 0: token is not present
* vector length is determines by the size of the vocabulary
    * vocabulary = set of tokens in dataset

In [7]:
# mapping that contaains words and their corresponding indices
word2index = { 'I':0, 'like':1, 'cats':2 }
# converting words to IDs or indices
words = [ 'I', 'like', 'cats' ]
word_ids = [ word2index[w] for w in words ]
print( word_ids )

[0, 1, 2]


In [11]:
# one-hot encoding with keras
from keras.utils.np_utils import to_categorical

onehot_1 = to_categorical( word_ids, num_classes=5 )
print( [ (w,ohe.tolist()) for w,ohe in zip( words, onehot_1 )])

[('I', [1.0, 0.0, 0.0, 0.0, 0.0]), ('like', [0.0, 1.0, 0.0, 0.0, 0.0]), ('cats', [0.0, 0.0, 1.0, 0.0, 0.0])]


In [12]:
# exploring the `to_categorical()` function
def compute_onehot_length(words, word2index):
  # Create word IDs for words
  word_ids = [word2index[w] for w in words]
  # Convert word IDs to onehot vectors
  onehot = to_categorical(word_ids)
  # Return the length of a single one-hot vector
  return onehot.shape[1]

word2index = {"He":0, "drank": 1, "milk": 2}
# Compute and print onehot length of a list of words
print(compute_onehot_length(['He','drank','milk'], word2index))

3


In [13]:
# use the num_classes parameter to set the length of the vectors
word2index = {'He': 6,'I': 0,'We': 3,'cats': 2,'dogs': 5,'hates': 7,'like': 4,'rabbits': 8}
words_1 = ["I", "like", "cats", "We", "like", "dogs", "He", "hates", "rabbits"]
# Call compute_onehot_length on words_1
length_1 = compute_onehot_length(words_1, word2index)

words_2 = ["I", "like", "cats", "We", "like", "dogs", "We", "like", "cats"]
# Call compute_onehot_length on words_2
length_2 = compute_onehot_length(words_2, word2index)

# Print length_1 and length_2
print("length_1 =>", length_1, " and length_2 => ", length_2)

length_1 => 9  and length_2 =>  6


<br>

### Encoder Decoder Model

A machine translation model works by, first, consuming words of the source language sequentially, and then, sequentially predicting the corresponding words in the target language  

Input $\longrightarrow$ **Encoder Model** $\longrightarrow$ Context Vector $\longrightarrow$  **Decoder Model** $\longrightarrow$ Output

**Writing the Encoder**  

    def words2onehot( word_list, word2index ):
        word_ids = [word2index[w] for w in word_list]
        onehot = to_categorical( word_ids, 3 )
        return onehot
        
    def encoder( onehot ):
        word_ids = np.argmax( onehot, axis=1 ):
        return word_ids 
        
    onehot = word2onehot(["I', 'like', 'cats']), words2index )
    context = encoder( onehot )
    print( context )
    
**Writing the Decoder**  

    def decoder( context_vector ):
        word_ids_rev = context_vector[::-1]
        onehot_rev = to_categorical( word_ids_rev, 3 )
        return onehot_rev
        
    def onehot2words( onehot, index2words):
        ids = np.argmax( onehot, axis = 1 )
        return [indext2word[id] for id in ids]
        
    onehot_rev = decoder( context )
    reversed_words = onehot2words( onehot_rev, index2word )
    print( reversed_words )
    
<br>

In [15]:
# The encoder

import numpy as np

word2index = {'I': 0, 'cats': 2, 'like': 1}

def words2onehot(word_list, word2index):
  # Convert words to word IDs
  word_ids = [word2index[w] for w in word_list]
  # Convert word IDs to onehot vectors and return the onehot array
  onehot = to_categorical(word_ids, num_classes=3)
  return onehot

words = ["I", "like", "cats"]
# Convert words to onehot vectors using words2onehot
onehot = words2onehot(words, word2index)
# Print the result as (<word>, <onehot>) tuples
print([(w,ohe.tolist()) for w,ohe in zip(words, onehot)])

[('I', [1.0, 0.0, 0.0]), ('like', [0.0, 1.0, 0.0]), ('cats', [0.0, 0.0, 1.0])]


In [16]:
# Encoder: Text reversing model
def encoder(onehot):
  # Get word IDs from onehot vectors and return the IDs
  word_ids = np.argmax(onehot, axis=1)
  return word_ids

# Define "We like dogs" as words
words = ['We','like','dogs']
# Define the word2index dict
word2index = {'We': 0, 'dogs': 2, 'like': 1}

# Convert words to onehot vectors using words2onehot
onehot = words2onehot(words, word2index)
# Get the context vector by using the encoder function
context = encoder(onehot)
print(context)

[0 1 2]


In [None]:
# Implementing the Decoder
# Define the onehot2words function that returns words for a set of onehot vectors
def onehot2words(onehot, index2word):
  ids = np.argmax(onehot, axis=1)
  res = [index2word[id] for id in ids]
  return res
# Define the decoder function that returns reversed onehot vectors
def decoder(context_vector):
  word_ids_rev = context_vector[::-1]
  onehot_rev = to_categorical(word_ids_rev, num_classes=3)
  return onehot_rev
# Convert context to reversed onehot vectors using decoder
onehot_rev = decoder(context)
# Get the reversed words using the onehot2words function
reversed_words = onehot2words(onehot_rev, index2word)
print(reversed_words)

<br>

### Understanding Sequential Models

**Time Series inputs and Sequential Models**  

* sentences as time series input
    * current word is affected by the previous words
* The encoder/decoder uses a machine leaarning model that: 
    * **sequential model** - can learn from times series inputs 
    
**Gated Recurrent Unit (GRU)** - sequential GRU units take in input ad pass a hidden state to the next unit until the sequence is processes. the hidden states at each unit represent the 'memory' of what the model has seen.  

**`Keras` (functional API) refresher**  

* `Keras` has two important objects: `Layer` and `Model` objects
* Input Layer
    * `inp = keras.layers.Input( shape = (...))`
* Hidden Layer
    * `layer = keras.layers.GRU(...)`
* Output
    * `out = layer( inp )`
* Model
    * `mode = Model( inputs=inp, outputs=out )`
    
**Understanding the Shape of the Data**  
* Sequence data is 3-dimensional
    1. **batch dimension** - the number of sequences
    2. **time dimension** - the length of the sequences
    3. **Input dimention** - length of the onehot vector (vocab length)
    
** Implementing GRUs with `Keras`**  

Defining `Keras` layers:  

    inp = keras.layers.Input( batchdim, timedim, inputdim ) 
    #for a model that takes arbitrary number of samples, leave out batchdim
    gru_out, gru_state = keras.layers.GRU( 10, return_state =True )(inp)
    #alternatively:
    gru_out = keras.layers.GRU( 10, return_sequences=True )(inp)
    
Defining a `Keras` model:  

    model = keras.model.Model( input=inp, outputs-gru_out )

Predicting with the `Keras` model:  

    x = np.random.normal( size = ( batchdim, timedim, inputdim ) )
    y = model.predict( x )
    print( "shape (y) =', y.shape, "\ny =\n", y )
    


In [17]:
#implement a simple model that has an input layer and a GRU layer. 
#You will then use the model to produce output values for a random input array.

import tensorflow.keras as keras
import numpy as np
# Define an input layer
inp = keras.layers.Input(batch_shape=(2,3,4))
# Define a GRU layer that takes in the input
gru_out = keras.layers.GRU(10)(inp)

# Define a model that outputs the GRU output
model = keras.models.Model(inputs=inp, outputs=gru_out)

x = np.random.normal(size=(2,3,4))
# Get the output of the model and print the result
y = model.predict(x)
print("shape (y) =", y.shape, "\ny = \n", y)

shape (y) = (2, 10) 
y = 
 [[-0.54010355  0.3191152   0.03250751  0.16956952  0.2524814   0.0261744
   0.0873507  -0.18221648  0.40233934  0.38475484]
 [-0.47358763  0.4035396   0.04614807  0.29768118  0.3857712   0.06155618
   0.21121213 -0.32127964  0.34948695  0.2716644 ]]


In [18]:
#see how you can use Keras models to accept arbitrary sized batches of inputs

# Define an input layer
inp = keras.layers.Input(shape=(3,4))
# Define a GRU layer that takes in the input
gru_out = keras.layers.GRU(10)(inp)
# Define a model that outputs the GRU output
model = keras.models.Model(inputs=inp, outputs=gru_out)

x1 = np.random.normal(size=(2,3,4))
x2 = np.random.normal(size=(5,3,4))

# Get the output of the model and print the result
y1 = model.predict(x1)
y2 = model.predict(x2)
print("shape (y1) = ", y1.shape, " shape (y2) = ", y2.shape)

shape (y1) =  (2, 10)  shape (y2) =  (5, 10)


<br>

## Implementing the Encoder/Decoder Model with `Keras`

### Implementing the Encoder

Understanding the Data:  

In [29]:
with open( 'vocab_fr.txt' ) as f:
    fr_text = f.readlines()
    
with open( 'vocab_en.txt' ) as f:
    en_text = f.readlines()

In [30]:
for en_sent, fr_sent in zip( en_text[:3], fr_text[:3]):
    print( 'ENglish: ', en_sent )
    print( 'Frnedch: ', fr_sent )

ENglish:  new jersey is sometimes quiet during autumn , and it is snowy in april .

Frnedch:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .

ENglish:  the united states is usually chilly during july , and it is usually freezing in november .

Frnedch:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .

ENglish:  california is usually quiet during march , and it is usually hot in june .

Frnedch:  california est généralement calme en mars , et il est généralement chaud en juin .



<br>

### Tokenizing the Sentences

Now to look at some of the attriutes of the DataSet  
**Tokenization** - the process of breaking a sentence/phrase to individual tokens  

In [32]:
first_sent = en_text[0]
print( 'first sentence: ', first_sent )
first_words = first_sent.split(' ')
print( '\tWords: ', first_words )

first sentence:  new jersey is sometimes quiet during autumn , and it is snowy in april .

	Words:  ['new', 'jersey', 'is', 'sometimes', 'quiet', 'during', 'autumn', ',', 'and', 'it', 'is', 'snowy', 'in', 'april', '.\n']


<br>

**Computing the average length of sentences**

In [34]:
sent_length = [len(text.split(' ')) for text in en_text]
mean_en_length = np.mean( sent_length )
print( 'ENGLIGH mean sentence length = ', mean_en_length)

sent_length = [len(text.split(' ')) for text in fr_text]
mean_fr_length = np.mean( sent_length )
print( 'FRENCH mean sentence length = ', mean_fr_length)

ENGLIGH mean sentence length =  13.225678224285508
FRENCH mean sentence length =  14.226737269693892


In [36]:
all_words = []
[all_words.extend( sent.split(' ')) for sent in en_text]
en_vocab_size = len( set( all_words ) )
print( 'ENGLISH vocab size = ', en_vocab_size )

all_words = []
[all_words.extend( sent.split(' ')) for sent in fr_text]
fr_vocab_size = len( set( all_words ) )
print( 'FRENCH vocab size = ', fr_vocab_size )

ENGLISH vocab size =  228
FRENCH vocab size =  357


<br>

**Implementing the Encoder with `Keras`  

Input Layer:  

    en_inputs = Input( shape=(en_len, en_vocab))
    
GRU Layer:  

    en_gru = GRU( hsize, return_state=True )
    en_out, en_state = en_gru( en_Inputs )
    
`Keras` Model:  

    encoder = Model( inputs=en_inputs, outputs=en_state )
    print( encoder.summary() )
    
<br>

In [37]:
# defining the Encoder

import tensorflow.keras as keras

en_len = 15
en_vocab = 150
hsize = 48

# Define an input layer
en_inputs = keras.layers.Input(shape=(en_len, en_vocab))
# Define a GRU layer which returns the state
en_gru = keras.layers.GRU(hsize, return_state = True)
# Get the output and state from the GRU
en_out, en_state = en_gru(en_inputs)
# Define and print the model summary
encoder = keras.models.Model(inputs=en_inputs, outputs=en_state)
print(encoder.summary() )

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 15, 150)]         0         
_________________________________________________________________
gru_5 (GRU)                  [(None, 48), (None, 48)]  28800     
Total params: 28,800
Trainable params: 28,800
Non-trainable params: 0
_________________________________________________________________
None


<br>

### Implementing the Decoder

**Encoder-Decoder Model**  

* Encoder consumes the English words one-by-one
* Finally produces the context vector
* Decoder takes the context vector as the initial state
* Decoder produces French words one-by-one
* Decoder is implemented using a `Keras` GPU layer. GRU requires two inputs:
    1. a time series input
    2. a hidden state

How to produce the time series input for the GRU layer?  

1. repeat the context vetor from the encoder N-many times
    * ex: To produce a french sentence of 10 words, you repeat the context vector 10 times. 
    
Understanding the `RepeatVector` layer:  

* takes one argument which defines the sequence length of the required output
* takes in an input of (batch_size, input_size)
* output data will have the shape ( batch_size, sequence_length, input_size )

**Defining a `RepeatVector` layer**  

    from tensorflow.keras.layers import RepeatVector
    rep = RepeatVector( 5 )
    
    r_inp = Input( shape( 3, ) )
    r_out = rep( r_inp )
    
    repeat_model = Model( inputs= r_inp, outputs = r_out )
    
**Predicting with the Model**  

    x = np.array( [ [0,1,2], [3,4,5] ] )
    y = repeat_model.predict( x )
    print( 'x.shape = ', x.shape, '\ny.shape = ', y.shape )

**Implementing the Decoder**  

    de_inputs = RepeatVector( fr_len )( en_state )
    decoder_gru = GRU( hsize, return_sequences=True )
    gru_outputs = decoder_gru( de_inputs, initial_state=en_state )

**Defining the Model**  

    enc_dec = Model( inputs= en_inputs, outputs = gru_outputs )
    
<br>

In [38]:
# explore how the RepeatVector layer works
from tensorflow.keras.layers import Input, RepeatVector
from tensorflow.keras.models import Model
import numpy as np

inp = Input(shape=(2,))
# Define a RepeatVector that repeats the input 6 times
rep = RepeatVector(6)(inp)
# Define a model
model = Model(inputs=inp, outputs=rep)
# Define input x
x = np.array([[0,1], [2,3]])
# Get model prediction y
y = model.predict( x )
print('x.shape = ',x.shape,'\ny.shape = ',y.shape)


x.shape =  (2, 2) 
y.shape =  (2, 6, 2)


In [40]:
# implement the decoder and define an end-to-end model going from encoder inputs to the decoder GRU outputs. 

hsize = 48
fr_len = 20
# Define a RepeatVector layer
de_inputs = RepeatVector(fr_len)(en_state)
# Define a GRU model that returns all outputs
decoder_gru = keras.layers.GRU(hsize, return_sequences=True)
# Get the outputs of the decoder
gru_outputs = decoder_gru(de_inputs, initial_state=en_state)
# Define a model with the correct inputs and outputs
enc_dec = Model(inputs=en_inputs, outputs=gru_outputs)
enc_dec.summary()

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 15, 150)]    0                                            
__________________________________________________________________________________________________
gru_5 (GRU)                     [(None, 48), (None,  28800       input_4[0][0]                    
__________________________________________________________________________________________________
repeat_vector_2 (RepeatVector)  (None, 20, 48)       0           gru_5[0][1]                      
__________________________________________________________________________________________________
gru_6 (GRU)                     (None, 20, 48)       14112       repeat_vector_2[0][0]            
                                                                 gru_5[0][1]                

<br>

### Dense and TimeDistributed Layers 

Introduction to the **Dense Layer** - a dense layer can be used to implement a fully-connected layer of a neural network.  

* Dense Layer takes an input vector and converts to a probabilistic prediction
    * y = Weightd.x + Bias  
    
Defining  Dense Laye with `Keras`:

    dense = keras.layers.Dense( vicab_size, activation = 'softmax' )
    inp = Input( shape=( vocab_size, )
    pred = dense( inp )
    model = Model( inputs=inp, outputs=pred )
    
Defining a Dense layer with custom initialization:

    from tensorflow.keras.initializers import RandomNormal
    init = RandomNormal( mean = 0.0, stddev = 0.05, seed = 6000 )
    dense = Dense( vocab_size, activation='softmax', kernel_initializer=init, bias_initializer=init )
    
Inputs and outputs of a Dense Layer:  

* Dense softmax layer
    * takes a (batch_size, input_size) array
    * produces a ( batch_size, num_classes ) array
    * output for each sample is a probability distribution over the classes which sums to 1
    * you can get the class of each sample using `np.argmax(y, axis=-1)`
    
Use a `TimeDistributed` layer as a wrapper for a `Dense` layer  

    dense_time = TimeDistributedd( Dense( vocab_size, activation='softmax' ) )
    inp = Input( shape = (  ) )
    pred = dense_time( inp )
    model = Model( inputs=inp, outputs=pred )
    
`TimeDistributed` Layer takes (batch_size, sequence_len, input_size) $\longrightarrow$ ( batch_size, sequence_len, num_classes ) array  

can get the class of each sample using `np.argmax( y, axis=-1 )`

Iterating through time-distributed data:

    for t in range( sequence_len ):
        for prob, c in zip( y[:,t,:], classes[:,t]):
            print( "prob: ', prob, ", Class: ', c )
            
<br>

In [45]:
init = keras.initializers.RandomNormal( mean = 0.0, stddev = 0.05, seed = 6000 )
# Define an input layer with batch size 3 and input size 3
inp = Input(batch_shape = (3,3))
# Get the output of the 3 node Dense layer
pred = keras.layers.Dense(3, activation='softmax', kernel_initializer=init, bias_initializer=init)(inp)
model = Model(inputs=inp, outputs=pred)

names = ["Mark", "John", "Kelly"]
prizes = ["Gift voucher", "Car", "Nothing"]
x = np.array([[5, 0, 1], [0, 3, 1], [2, 2, 1]])
# Compute the model prediction for x
y = model.predict(x)
# Get the most probable class for each sample
classes = np.argmax(y, axis=-1)
print("\n".join(["{} has probabilities {} and wins {}".format(n,p,prizes[c]) \
                 for n,p,c in zip(names, y, classes)]))

Mark has probabilities [0.3929537  0.37995604 0.22709025] and wins Gift voucher
John has probabilities [0.33233336 0.34169823 0.32596847] and wins Car
Kelly has probabilities [0.35587627 0.35802534 0.28609842] and wins Car


In [53]:
names = [['Mark', 'John', 'Kelly'], ['Jenny', 'Shan', 'Sarah']]
x = np.array([[[5, 0, 1],[1, 1, 0]],
           [[0, 3, 1],[0, 4, 0]],
           [[2, 2, 1],[6, 0, 1]]])
# Print names and x
print('names=\n',names, '\nx=\n',x, '\nx.shape=', x.shape)
inp = Input(shape=(2, 3))
# Create the TimeDistributed layer (the output of the Dense layer)
dense_time = keras.layers.TimeDistributed(keras.layers.Dense(3, activation='softmax', kernel_initializer=init, bias_initializer=init))
pred = dense_time(inp)
model = Model(inputs=inp, outputs=pred)

y = model.predict(x)
# Get the most probable class for each sample
classes = np.argmax(y, axis=-1)
for t in range(2):
  # Get the t-th time-dimension slice of y and classes
  for n, p, c in zip(names[t], y[:, t, :], classes[:, t]):
  	print("Game {}: {} has probs {} and wins {}\n".format(t+1,n,p,prizes[c]))

names=
 [['Mark', 'John', 'Kelly'], ['Jenny', 'Shan', 'Sarah']] 
x=
 [[[5 0 1]
  [1 1 0]]

 [[0 3 1]
  [0 4 0]]

 [[2 2 1]
  [6 0 1]]] 
x.shape= (3, 2, 3)
Game 1: Mark has probs [0.3929537  0.37995604 0.22709025] and wins Gift voucher

Game 1: John has probs [0.33233336 0.34169823 0.32596847] and wins Car

Game 1: Kelly has probs [0.35587627 0.35802534 0.28609842] and wins Car

Game 2: Jenny has probs [0.34050465 0.3426381  0.31685725] and wins Car

Game 2: Shan has probs [0.3069249  0.32335538 0.36971974] and wins Nothing

Game 2: Sarah has probs [0.3994818  0.38477215 0.21574609] and wins Gift voucher



<br>

### Implementing the Full Encoder/Decoder Model

still need a top part of the decoder.  
implement this with a `TimeDistributed` & `Dense` layer

![](encoder_decoder.png)  

Implementing the full model:  

    # The softmax prediction layer
    de_dense = keras.layers.Dense( fr_vocab_size, activation='softmax' )
    de_dense_time = keras.layers.TimeDistributed( de_dense )
    de_pred = de_seq_dense( de_out )
    
    # Defining the full model
    nmt = keras.models.Model( inputs = en_inputs, outputs = de_pred )
    
    # Compiling the model
    nmt.compile( optimizer='adam', loss='categorical_crossentropy`, metrics['acc'])
    
<br>

In [56]:
# Import Dense and TimeDistributed layers
from tensorflow.keras.layers import Dense, TimeDistributed
# Define a softmax dense layer that has fr_vocab outputs
de_dense = Dense(fr_vocab_size, activation='softmax')
# Wrap the dense layer in a TimeDistributed layer
de_dense_time = TimeDistributed(de_dense)
# Get the final prediction of the model
de_pred = de_dense_time(gru_outputs)
print("Prediction shape: ", de_pred.shape)

Prediction shape:  (None, 20, 357)


In [57]:
from tensorflow.keras.models import Model
# Define a model with encoder input and decoder output
nmt = Model(inputs=en_inputs, outputs=de_pred)

# Compile the model with an optimizer and a loss
nmt.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

# View the summary of the model 
nmt.summary()

Model: "model_10"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 15, 150)]    0                                            
__________________________________________________________________________________________________
gru_5 (GRU)                     [(None, 48), (None,  28800       input_4[0][0]                    
__________________________________________________________________________________________________
repeat_vector_2 (RepeatVector)  (None, 20, 48)       0           gru_5[0][1]                      
__________________________________________________________________________________________________
gru_6 (GRU)                     (None, 20, 48)       14112       repeat_vector_2[0][0]            
                                                                 gru_5[0][1]               