<a href="https://colab.research.google.com/github/HoangLong1907/DeepLearning-CS431/blob/main/Baitap/Neural_Machine_Translation(_Eng_Vie).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Machine Translation (NMT) - Translating English sentences to Marathi sentences

Machine Translation refers to translating phrases across languages using deep learning and specifically with RNN ( Recurrent Neural Nets ). Most of these are complex systems that is they are a combined system of various algorithms. But, at its core, NMT uses sequence-to-sequence ( seq2seq ) RNN cells. Such models could be character level but word level models remain common.

![NMT system](https://3.bp.blogspot.com/-3Pbj_dvt0Vo/V-qe-Nl6P5I/AAAAAAAABQc/z0_6WtVWtvARtMk0i9_AtLeyyGyV6AI4wCLcB/s1600/nmt-model-fast.gif)

I insist to change the runtime to a GPU runtime so that training could be faster.

## What are we going to do?
We will basically create an encoder-decoder LSTM model using [Keras Functional API](https://www.tensorflow.org/alpha/guide/keras/functional) ( with [TensorFlow](https://www.tensorflow.org/) ). We will convert the English sentences to [Marathi](https://en.wikipedia.org/wiki/Marathi_language) ( A language native to India ). But, why Marathi?


*   Has special characters and much complex.
*   Has a totally new script ( Devnagiri ) with no pretrained word-embeddings available yet.

Here's an example,

the cat sleeps among the dogs  ->  मांजर कुत्रींमध्ये झोपतात

So, let's get started.



## Preparing the Data

### 1) Importing the libraries

We will import TensorFlow and Keras. From Keras, we import various modules which help in building NN layers, preprocess data and construct LSTM models.

In [None]:

%tensorflow_version 2.x

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers , activations , models , preprocessing , utils
import pandas as pd

tf.compat.v1.logging.set_verbosity( tf.compat.v1.logging.ERROR ) # Just to remove warnings!

print( tf.__version__ )


2.5.0


### 2) Reading the data


Our dataset which contains more than 30K pairs of English-Marathi phrases. This amazing dataset is available at http://www.manythings.org/anki/ and it also other 50+ sets of bilingual sentences. We download the dataset for English-Marathi phrases, unzip it and read it using [Pandas](https://pandas.pydata.org/).

In [None]:

!wget http://www.manythings.org/anki/vie-eng.zip -O vie-eng.zip
!unzip vie-eng.zip


--2021-07-05 09:29:33--  http://www.manythings.org/anki/vie-eng.zip
Resolving www.manythings.org (www.manythings.org)... 172.67.173.198, 104.21.55.222, 2606:4700:3031::6815:37de, ...
Connecting to www.manythings.org (www.manythings.org)|172.67.173.198|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 298429 (291K) [application/zip]
Saving to: ‘vie-eng.zip’


2021-07-05 09:29:34 (1.89 MB/s) - ‘vie-eng.zip’ saved [298429/298429]

Archive:  vie-eng.zip
replace _about.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: _about.txt              
replace vie.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: vie.txt                 


In [None]:

lines = pd.read_table( 'vie.txt' , names=[ 'eng' , 'vie' ] )
lines.reset_index( level=0 , inplace=True )
lines.rename( columns={ 'index' : 'eng' , 'eng' : 'vie' , 'vie' : 'c' } , inplace=True )
lines = lines.drop( 'c' , 1 )
#lines = lines.iloc[ 10000 : 20000 ] 
lines.head()


Unnamed: 0,eng,vie
0,Run!,Chạy!
1,Help!,Giúp tôi với!
2,Go on.,Tiếp tục đi.
3,Hello!,Chào bạn.
4,Hurry!,Nhanh lên nào!


### 3) Preparing input data for the Encoder ( `encoder_input_data` )
The Encoder model will be fed input data which are preprocessed English sentences. The preprocessing is done as follows :


1.   Tokenizing the English sentences from `eng_lines`.
2.   Determining the maximum length of the English sentence that's `max_input_length`.
3.   Padding the `tokenized_eng_lines` to the max_input_length.
4.   Determining the vocabulary size ( `num_eng_tokens` ) for English words.





In [None]:
import numpy as np

eng_lines = list()
for line in lines.eng:
    eng_lines.append( line ) 

tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( eng_lines ) 
tokenized_eng_lines = tokenizer.texts_to_sequences( eng_lines ) 

length_list = list()
for token_seq in tokenized_eng_lines:
    length_list.append( len( token_seq ))
#max_input_length =  max((np.array( length_list),))
max_input_length =  np.array( length_list).max()
print( 'English max length is {}'.format( max_input_length ))

padded_eng_lines =preprocessing.sequence.pad_sequences( tokenized_eng_lines , maxlen=max_input_length , padding='post' )
encoder_input_data = np.array( padded_eng_lines )
print( 'Encoder input data shape -> {}'.format( encoder_input_data.shape ))

eng_word_dict = tokenizer.word_index
num_eng_tokens = len( eng_word_dict )+1
print( 'Number of English tokens = {}'.format( num_eng_tokens))


English max length is 32
Encoder input data shape -> (7547, 32)
Number of English tokens = 3712


### 4) Preparing input data for the Decoder ( `decoder_input_data` )
The Decoder model will be fed the preprocessed Marathi lines. The preprocessing steps are similar to the ones which are above. This one step is carried out before the other steps.


*   Append `<START>` tag at the first position in  each Marathi sentence.
*   Append `<END>` tag at the last position in  each Marathi sentence.





In [None]:

vie_lines = list()
for line in lines.vie:
    vie_lines.append( '<START> ' + line + ' <END>' )  

tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( vie_lines ) 
tokenized_vie_lines = tokenizer.texts_to_sequences( vie_lines ) 

length_list = list()
for token_seq in tokenized_vie_lines:
    length_list.append( len( token_seq ))
max_output_length = np.array( length_list ).max()
print( 'vieathi max length is {}'.format( max_output_length ))

padded_vie_lines = preprocessing.sequence.pad_sequences( tokenized_vie_lines , maxlen=max_output_length, padding='post' )
decoder_input_data = np.array( padded_vie_lines )
print( 'Decoder input data shape -> {}'.format( decoder_input_data.shape ))

vie_word_dict = tokenizer.word_index
num_vie_tokens = len( vie_word_dict )+1
print( 'Number of vieathi tokens = {}'.format( num_vie_tokens))


vieathi max length is 43
Decoder input data shape -> (7547, 43)
Number of vieathi tokens = 2364


### 5) Preparing target data for the Decoder ( decoder_target_data ) 

We take a copy of `tokenized_mar_lines` and modify it like this.



1.   We remove the `<start>` tag which we appended earlier. Hence, the word ( which is `<start>` in this case  ) will be removed.
2.   Convert the `padded_mar_lines` ( ones which do not have `<start>` tag ) to one-hot vectors.

For example :

```
 [ '<start>' , 'hello' , 'world' , '<end>' ]

```

wil become 

```
 [ 'hello' , 'world' , '<end>' ]

```


In [None]:

decoder_target_data = list()
for token_seq in tokenized_vie_lines:
    decoder_target_data.append( token_seq[ 1 : ] ) 
    
padded_vie_lines = preprocessing.sequence.pad_sequences( decoder_target_data , maxlen=max_output_length, padding='post' )
onehot_vie_lines = utils.to_categorical( padded_vie_lines , num_vie_tokens )
decoder_target_data = np.array( onehot_vie_lines )
print( 'Decoder target data shape -> {}'.format( decoder_target_data.shape ))


Decoder target data shape -> (7547, 43, 2364)


## Defining and Training the models

### 1) Defining the Encoder-Decoder model
The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.


*   2 Input Layers : One for `encoder_input_data` and another for `decoder_input_data`.
*   Embedding layer : For converting token vectors to fix sized dense vectors. **( Note :  Don't forget the `mask_zero=True` argument here )**
*   LSTM layer : Provide access to Long-Short Term cells.

Working : 

1.   The `encoder_input_data` comes in the Embedding layer (  `encoder_embedding` ). 
2.   The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors ( `h` and `c` which are `encoder_states` )
3.   These states are set in the LSTM cell of the decoder.
4.   The decoder_input_data comes in through the Embedding layer.
5.   The Embeddings goes in LSTM cell ( which had the states ) to produce seqeunces.









In [None]:

encoder_inputs = tf.keras.layers.Input(shape=( None , ))
encoder_embedding = tf.keras.layers.Embedding( num_eng_tokens, 256 , mask_zero=True ) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 128 , return_state=True  )( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( None ,  ))
decoder_embedding = tf.keras.layers.Embedding( num_vie_tokens, 256 , mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 128 , return_state=True , return_sequences=True)
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( num_vie_tokens , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 256)    950272      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 256)    605184      input_2[0][0]                    
______________________________________________________________________________________________

### 2) Training the model
We train the model for a number of epochs with RMSprop optimizer and categorical crossentropy loss function.

In [None]:

model.fit([encoder_input_data , decoder_input_data], decoder_target_data, batch_size=250, epochs=50 ) 
model.save( 'model.h5' ) 


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## Inferencing on the models

### 1) Defining inference models
We create inference models which help in predicting translations.

**Encoder inference model** : Takes the English sentence as input and outputs LSTM states ( `h` and `c` ).

**Decoder inference model** : Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the Marathi input seqeunces ( ones not having the `<start>` tag ). It will output the translations of the English sentence which we fed to the encoder model and its state values.





In [None]:

def make_inference_models():
    
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    
    decoder_state_input_h = tf.keras.layers.Input(shape=( 128 ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( 128 ,))
    
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model


### 2) Making some translations


1.   First, we take a English sequence and predict the state values using `enc_model`.
2.   We set the state values in the decoder's LSTM.
3.   Then, we generate a sequence which contains the `<start>` element.
4.   We input this sequence in the `dec_model`.
5.   We replace the `<start>` element with the element which was predicted by the `dec_model` and update the state values.
6.   We carry out the above steps iteratively till we hit the `<end>` tag or the maximum sequence length.







In [None]:

def str_to_tokens( sentence : str ):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append( eng_word_dict[ word ] ) 
    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=max_input_length , padding='post')


In [None]:

enc_model , dec_model = make_inference_models()

for epoch in range( encoder_input_data.shape[0] ):
    states_values = enc_model.predict( str_to_tokens( input( 'Enter eng sentence : ' ) ) )
    #states_values = enc_model.predict( encoder_input_data[ epoch ] )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = vie_word_dict['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in vie_word_dict.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word
        
        if sampled_word == 'end' or len(decoded_translation.split()) > max_output_length:
            stop_condition = True
            
        empty_target_seq = np.zeros( ( 1 , 1 ) )  
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ] 

    print( decoded_translation )


 xin lỗi end
 bạn đang đi end
 tôi thích bạn end
 bạn thích cái này end
 tôi rất rất nhiều end
 hãy đóng cửa end
 hãy đi một nhà end
 đó là một người không end
