# Sequence to Sequence Models:

## Introduction

Sequence to Sequence Architecture is one of the important architectures in the neural network. It helps in tasks like Machine translation, text generation, chatbot, ..etc.

The Sequence to Sequence architecture is named as Seq2Seq modeling because it tries to convert one sequence into another.

Let's explain an example to understand more, in Machine Translation, we feed to the network input an Arabic sentence and obtain an English sentence from the output. So, here is the Seq2Seq modeling output English sentence from Arabic sentence. Although the Arabic letters are in another format which is completely different from English letters. How the Seq2Seq knew the sentence in English from Arabic?

Good question! let's go deeper to know how Seq2Seq does that for us. To answer this question we should know the Seq2Seq architecture.
    
## Seq2Seq Architecture

Seq2Seq architecture consists of:
- Encoder and its output context vector.
- Decoder.

The Encoder and decoder together have been the following shape:

![title](img/encoder_decoder_arch.png)



### Encoder



- Job:

This is the 1st component in the encoder-decoder architecture.  It builds the representation of the input vector and embeds the input meaning inside the context vector.

- Architecture:

It can be built using the LSTM, RNN, GRU, BiLSTM. The last hidden state of the encoder will hold the context of the entire input sentence. The encoder tries to build and embedding for the entire input sentence.

- How does it work?

To be able to imagine how it works, please look at the image below:

![title](img/encoder.png)


As in the above image, the embedding of input words is fed into LSTM stack sequentially, then each cell hidden state is fed into the next LSTM until we reach to the last one where we obtain the context vector from the last cell hidden state.

- Using Keras: 

Feeding the encoder last hidden state using keras, just enable the return_state to enable the LSTM statck to return the last hidden state.

### Decoder

- Job:

Generally, 'it decodes the context vector to have the same required output'. In out example, It builds the English representation from the Arabic words encoded inside the context vector. 

- Architecture:

Also, It can be built using RNN, LSTM, GRU.

- How does it work?

To be able to understand the decoder well, we should realize that the decoder has two methods of working according to the  stage. There are two stages in any machine learning algorithm training, inference or prediction. The decoder work is a little bit difrent in training and inference. 

#### Decorder in Training


![title](img/decoder_Training.png)


As in the above image, the decoder the decoder receives the context vector from the encoder, and start token to start its work to predict the 1st word according to the the received context vector. The start token will be input to the 1st cell in the decoder. But the 2nd LSTM cell as in above image will receive the correct token from training data not from the previous LSTM output as in infernece 'As we will see later'. This input token from training data will help the 2nd LSTM with the previous hidden state and cell state to predict the 2nd word correctly. Then then the 2nd LSTM output the predcited 3rd token and its feed its hidden and cell state to the 3rd LSTM. This senario will be repeated unitl the decoder see end token or the reach to the maximum length.


#### Decoder in inference or prediction

![title](img/decoder_inference.png)

As in the above image, the decoder receives the context vector from the encoder. The decoder 1st cell input starts, once the cell sees start token inside the input, it starts to predict the 1st word according to the received context vector. Then this word is fed to the next LSTM as input to help it decode the second word with the help of the hidden state that comes from the previous LSTM. The decoder works with this way until it reaches to the max_length or until it finds the word 'end'.


## Machine Translation using Seq2Seq Modeling:

In [55]:
#pip install camel-tools


Collecting camel-tools
  Downloading camel_tools-1.2.0.tar.gz (58 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting future
  Using cached future-0.18.2-py3-none-any.whl
Collecting docopt
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting dill
  Using cached dill-0.3.4-py2.py3-none-any.whl (86 kB)
Collecting editdistance
  Downloading editdistance-0.6.0-cp38-cp38-win_amd64.whl (24 kB)
Building wheels for collected packages: camel-tools, docopt
  Building wheel for camel-tools (setup.py): started
  Building wheel for camel-tools (setup.py): finished with status 'done'
  Created wheel for camel-tools: filename=camel_tools-1.2.0-py3-none-any.whl size=99029 sha256=0c626af20eb25c7722f2c8d404adbe823fc00f6b26ed67f7e56c6989cadd53e0
  Stored in directory: c:\users\marwa\appdata\local\pip\cache\wheels\b8\3b\9f\910d7d11709d8be2fb

You should consider upgrading via the 'c:\users\marwa\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.


In [119]:
import pandas as pd
import numpy as np
import tensorflow as tf
from keras.models import Model
from keras.layers import Dense,LSTM,Embedding,Input
import io
from string import digits
import string
import tkseem as tk
# instantiate the Maximum Likelihood Disambiguator
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.tokenizers.morphological import MorphologicalTokenizer


Obtaining the dataset for Arabic and English

In [2]:
def read_data(file):
    file_data = []
    data = []
    Arabic_data = []
    english_data = []
    # Read the file lines
    with open(file, 'r', encoding='utf-8') as f:
        file_data = f.readlines()
    # separate the lines using '\t'
    for line in (file_data):
        english_sent, arabic_sent, _ = line.split('\t')
        Arabic_data.append(arabic_sent)
        english_data.append(english_sent)
        
    return english_data, Arabic_data


In [3]:
file_path = 'datasets/ara-eng/ara.txt'
english_data,  Arabic_data= read_data(file_path)

In [4]:
english_data[0:10]

['Hi.',
 'Run!',
 'Duck!',
 'Duck!',
 'Duck!',
 'Help!',
 'Jump!',
 'Stop!',
 'Stop!',
 'Wait!']

In [5]:
Arabic_data[50:100]

['إلى اللقاء',
 'إنتظر',
 'لقد أتى.',
 'هو يجري',
 'ساعدني!',
 'النجدة! ساعدني!',
 'ساعدوني',
 'انتظر.',
 'أنا موافق',
 'أنا حزين.',
 'أنا أيضاً.',
 'اخرس!',
 'اصمت!',
 'اسكت!',
 'أغلق فمك!',
 'أوقفه',
 'خذه',
 'أخبرني',
 'توم فاز.',
 'لقد ربح توم.',
 'استيقظ!',
 'أهلاً و سهلاً!',
 'مرحباً بك!',
 'اهلا وسهلا',
 'مرحبا!',
 'من فاز؟',
 'من الذي ربح؟',
 'لم لا؟',
 'لما لا؟',
 'لا فكرة لدي',
 'استمتع بوقتك.',
 'أسرعا.',
 'لقد نسيت.',
 'فهمتُهُ.',
 'فهمتُها.',
 'فَهمتُ ذلك.',
 'أستخدمه.',
 'سأدفع أنا.',
 'أنا مشغول.',
 'إنني مشغول.',
 'أشعر بالبرد.',
 'أنا حُرّ.',
 'أنا هنا',
 'لقد عدت إلى البيت',
 'أنا فقير.',
 'أنا ثري.',
 'هذا مؤلم',
 'انها جافه',
 'الجو حار',
 'إنه جديد']

In [6]:
len(Arabic_data)

12158

In [7]:
len(english_data)

12158

We have 12158 line in both

In [8]:
dataset = pd.DataFrame({'Arabic_input':Arabic_data, 'English_target':english_data})

In [9]:
dataset.head()

Unnamed: 0,Arabic_input,English_target
0,مرحبًا.,Hi.
1,اركض!,Run!
2,اخفض رأسك!,Duck!
3,اخفضي رأسك!,Duck!
4,اخفضوا رؤوسكم!,Duck!


### Text Preprocessing

#### Arabic Normalization
The Arabic has a special preprocessing, why?

Arabic has difirrent characteristics like:
- The word in Arabic can mean a complete sentence in other languages. So, it requires a special segmentation step which considers the Arabic language rules. 
- Arabic has diactrics which should be normalized. 
- Some Aabic letters has more than one shape so, it should be unified.

If you would like to know more about Arabic charasteristics, please read this part  "Arabic Challenges in the Context of NER" in the the following paper:

https://thescipub.com/pdf/jcssp.2020.117.125.pdf


In [10]:
# from the following repo: https://github.com/motazsaad/process-arabic-text/blob/master/clean_arabic_text.py
import re
import string
import sys
import argparse

arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
punctuations_list = arabic_punctuations + english_punctuations

arabic_diacritics = re.compile("""
                             ّ    | # Tashdid
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                             ـ     # Tatwil/Kashida
                         """, re.VERBOSE)


def normalize_arabic(text):
    #text = re.sub("[إأآا]", "ا", text)

    #text = re.sub("ى", "ي", text)
    #text = re.sub("ؤ", "ء", text)
    #text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

def remove_digits(text):
    text = re.sub(r"[1234567890١٢٣٤٥٦٧٨٩٠]+", "", text)
    return text

def remove_english_characters(text):
    text = re.sub(r'[a-zA-Z]+','',text)
    return text
    

def remove_diacritics(text):
    text = re.sub(arabic_diacritics, '', text)
    return text


def remove_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)


In [11]:
def Arabic_normalization(Arabic_sentence_list):
    Arabic_data_list = []
    for item in Arabic_sentence_list:
        text = remove_english_characters(item)
        text = remove_digits(text)
        text = normalize_arabic(text)
        text = remove_diacritics(text)
        text = remove_punctuations(text)
        Arabic_data_list.append(text)
        
    return Arabic_data_list    
 

In [12]:
dataset['Arabic_input'] = Arabic_normalization(dataset.Arabic_input)

In [13]:
dataset['Arabic_input'][:10]

0            مرحبا
1             اركض
2        اخفض رأسك
3       اخفضي رأسك
4    اخفضوا رؤوسكم
5           النجده
6             اقفز
7               قف
8            توقف 
9            إنتظر
Name: Arabic_input, dtype: object

#### English Normalization:


In [14]:
def English_normalization(English_target_ls):
    English_data_list = []
    # Since we work on word level, if we normalize the text to lower case, this will reduce the vocabulary. 
    #It's easy to recover the case later. 
    English_data_list = English_target_ls.apply(lambda x: x.lower())

    # Clean up punctuations and digits. Such special chars are common to both domains, and can just be copied with no error.
    exclude = set(string.punctuation)
    English_data_list = English_data_list.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

    remove_digits = str.maketrans('', '', digits)
    English_data_list = English_data_list.apply(lambda x: x.translate(remove_digits))
    
    return English_data_list


In [15]:
dataset['English_target'] = English_normalization(dataset.English_target)

In [16]:
dataset.head()

Unnamed: 0,Arabic_input,English_target
0,مرحبا,hi
1,اركض,run
2,اخفض رأسك,duck
3,اخفضي رأسك,duck
4,اخفضوا رؤوسكم,duck


### Data preparation:
In this step we start to convert the data for training. 
- Adding the start and end token to the target language.Since the english is our target, so we will add the start and tokens to it. lets see how can we do this?

In [17]:
st_tok = 'START_'
end_tok = '_END'
def data_prep():
    dataset.English_target = dataset.English_target.apply(lambda x : st_tok + ' ' + x + ' ' + end_tok)

In [18]:
data_prep()

In [19]:
dataset.head()

Unnamed: 0,Arabic_input,English_target
0,مرحبا,START_ hi _END
1,اركض,START_ run _END
2,اخفض رأسك,START_ duck _END
3,اخفضي رأسك,START_ duck _END
4,اخفضوا رؤوسكم,START_ duck _END


- Tokenization
Arabic tokenization is completely diffrent from English tokenization. English tokenization depends on spaces, but in Arabic this is not valid. Since the token in Arabic can be used to mean a complete sentence in another languages.

Note, I used her the camel_tools which is not

#### Arabic Tokenization

In [20]:
from camel_tools.tokenizers.word import simple_word_tokenize

def tokenize_Arabic():
    # The tokenizer expects pre-tokenized text
    Arabic_input_ls = dataset.Arabic_input.apply(simple_word_tokenize)

    # Load a pretrained disambiguator to use with a tokenizer
    mle = MLEDisambiguator.pretrained('calima-msa-r13')

    # By specifying `split=True`, the morphological tokens are output as seperate
    # strings.
    tokenizer = MorphologicalTokenizer(mle,scheme='d3tok', split=True)
    Arabic_input_ls = Arabic_input_ls.apply(tokenizer.tokenize)
    
    return Arabic_input_ls



In [21]:
def remove_plus(tokens):
    sentence_ls = []
    for token in tokens:
        if '+' in token:
            token_without_plus = token.replace('+','')
            sentence_ls.append(token_without_plus) 
        else:
            sentence_ls.append(token) 

            
    return sentence_ls
#Arabic_input_ls = Arabic_input_ls.apply(remove_plus)

In [54]:
tokenized_ds_copy = dataset.copy()
tokenized_ds_copy.head()

Unnamed: 0,Arabic_input,English_target
0,مرحبا,START_ hi _END
1,اركض,START_ run _END
2,اخفض رأسك,START_ duck _END
3,اخفضي رأسك,START_ duck _END
4,اخفضوا رؤوسكم,START_ duck _END


Note: The camel tool returned the hamza letter again but in a unified way for all the words. I mean the same word can't exist in two difrrent spellings.
#### English tokenization
English is tokenized according to spaces. 

In [63]:
def tok_split_word2word(data):
    return data.split(' ')


#### Tokenization

In [55]:

def data_stats(tokenized_ds_copy):
    #Obtain the tokenized words in Arabic
    tokenized_ds_copy['Arabic_input'] = tokenize_Arabic()
    # The tokenization output has + in the separated token, which should be removed
    tokenized_ds_copy['Arabic_input'] = tokenized_ds_copy.Arabic_input.apply(remove_plus)
    
    #create a set to hold all Arabic words uniquely.
    input_tokens=set()
    for item in tokenized_ds_copy.Arabic_input:
        for tok in item:
            input_tokens.add(tok)
    
    #Obtain the tokenized words in English dataset
    tokenized_ds_copy['English_target'] = tokenized_ds_copy.English_target.apply(tok_split_word2word)
    
    #create a set to hold all English words uniquely.
    target_tokens=set()
    for item in tokenized_ds_copy.English_target:
        for tok in item:
            target_tokens.add(tok)
        
    input_tokens = sorted(list(input_tokens))
    target_tokens = sorted(list(target_tokens))


    
    num_encoder_tokens = len(input_tokens)
    num_decoder_tokens = len(target_tokens)
    
    #To obtin the maximum number of words inside Arabic and English dataset.
    max_encoder_seq_length = np.max([len(l) for l in tokenized_ds_copy.Arabic_input])
    max_decoder_seq_length = np.max([len(l) for l in tokenized_ds_copy.English_target])

    return input_tokens, target_tokens, num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length



In [57]:
input_tokens, target_tokens, num_encoder_tokens, num_decoder_tokens, max_encoder_seq_length, max_decoder_seq_length  = data_stats(tokenized_ds_copy)

In [62]:
print('Number of samples:', len(dataset))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 12158
Number of unique input tokens: 7205
Number of unique output tokens: 4298
Max sequence length for inputs: 52
Max sequence length for outputs: 36


### Vectorization

In this step we will build our vocab2int table which will be used to map between words and their indices.'Machine Learning can't work directly with words since computer doen't understand words, so it should be converted into numbers'.

Note that the pad and separation should be considered during obtaining the vocab2int.

In [67]:
pad_tok = 'PAD'
sep_tok = ' '
special_tokens = [pad_tok, sep_tok, st_tok, end_tok] 

#Increase the number of token by the number of special characters.
num_encoder_tokens += len(special_tokens)
num_decoder_tokens += len(special_tokens)


In [68]:
def vocab(input_tokens, target_tokens):
    input_token_index = {}
    target_token_index = {}
    for i,tok in enumerate(special_tokens):
        input_token_index[tok] = i
        target_token_index[tok] = i 

    offset = len(special_tokens)
    for i, tok in enumerate(input_tokens):
        input_token_index[tok] = i+offset

    for i, tok in enumerate(target_tokens):
        target_token_index[tok] = i+offset
   
    # Reverse-lookup token index to decode sequences back to something readable.
    reverse_input_tok_index = dict(
        (i, tok) for tok, i in input_token_index.items())
    reverse_target_tok_index = dict(
        (i, tok) for tok, i in target_token_index.items())
    return input_token_index, target_token_index, reverse_input_tok_index, reverse_target_tok_index

In [69]:
input_token_index, target_token_index, reverse_input_tok_index, reverse_target_tok_index = vocab(input_tokens, target_tokens)

In [87]:
input_token_index

{'PAD': 0,
 ' ': 1,
 'START_': 2,
 '_END': 3,
 'NOAN': 4,
 'آب': 5,
 'آباء': 6,
 'آبد': 7,
 'آبقو': 8,
 'آت': 9,
 'آتون': 10,
 'آتي': 11,
 'آثار': 12,
 'آخذ': 13,
 'آخر': 14,
 'آخرة': 15,
 'آخرعلي': 16,
 'آخرون': 17,
 'آخرين': 18,
 'آدم': 19,
 'آذار': 20,
 'آذان': 21,
 'آذيتم': 22,
 'آراء': 23,
 'آرية': 24,
 'آسفون': 25,
 'آسيا': 26,
 'آفاق': 27,
 'آكل': 28,
 'آلاف': 29,
 'آلام': 30,
 'آلة': 31,
 'آلن': 32,
 'آلي': 33,
 'آليا': 34,
 'آمال': 35,
 'آمل': 36,
 'آمن': 37,
 'آمنة': 38,
 'آن': 39,
 'آنا': 40,
 'آنذاك': 41,
 'آنس': 42,
 'آني': 43,
 'آية': 44,
 'أ': 45,
 'أأريتها': 46,
 'أأشتري': 47,
 'أأنت': 48,
 'أؤجر': 49,
 'أؤذي': 50,
 'أؤكد': 51,
 'أؤلف': 52,
 'أؤمن': 53,
 'أإلى': 54,
 'أاصيب': 55,
 'أب': 56,
 'أبإمكانك': 57,
 'أبا': 58,
 'أبتاع': 59,
 'أبتز': 60,
 'أبتسم': 61,
 'أبتل': 62,
 'أبحار': 63,
 'أبحث': 64,
 'أبد': 65,
 'أبدأ': 66,
 'أبدا': 67,
 'أبدو': 68,
 'أبدوا': 69,
 'أبدين': 70,
 'أبذل': 71,
 'أبر': 72,
 'أبرد': 73,
 'أبريل': 74,
 'أبشع': 75,
 'أبعد': 76,
 'أبق': 77,
 'أبق

I will put the max_encoder_seq for both decoder and encoder with 64 since the max number of tokens in Arabic is 52 and in English is 34

In [81]:
max_encoder_seq_length = 64
max_decoder_seq_length = 64

In [82]:
dataset.head()

Unnamed: 0,Arabic_input,English_target
0,مرحبا,START_ hi _END
1,اركض,START_ run _END
2,اخفض رأسك,START_ duck _END
3,اخفضي رأسك,START_ duck _END
4,اخفضوا رؤوسكم,START_ duck _END


In [84]:
def init_input_target(dataset, max_encoder_seq_length, max_decoder_seq_length, num_decoder_tokens):
    # The input setence to the encoder is 64 leghth.
    encoder_input_data = np.zeros( (len(dataset.Arabic_input), max_encoder_seq_length),dtype='float32')
    # The input setence to the decoder is 64 leghth.
    decoder_input_data = np.zeros((len(dataset.English_target), max_decoder_seq_length), dtype='float32')
    # The output setence of the decoder is 64 x all_tokens_inside_encoder. Since the decoder will do softmax for each token.
    decoder_target_data = np.zeros((len(dataset.English_target), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
    
    return encoder_input_data, decoder_input_data, decoder_target_data

In [85]:
encoder_input_data, decoder_input_data, decoder_target_data = init_input_target(dataset, max_encoder_seq_length, max_decoder_seq_length, num_decoder_tokens)

In [163]:
def vectorize(tokenized_ds_copy, max_encoder_seq_length, max_decoder_seq_length, num_decoder_tokens):

    for i, (input_text_ls, target_text_ls) in enumerate(zip(tokenized_ds_copy.Arabic_input, tokenized_ds_copy.English_target)):
        # preparing the encoder inputs
        for t, tok in enumerate(input_text_ls):
            #To obtain the ids of encoder sentence's tokens from input_token_index
            encoder_input_data[i, t] = input_token_index[tok]
            
        encoder_input_data[i, t+1:] = input_token_index[pad_tok]
        
        # This loop is used to prepare the input and output of the decoder
        for t, tok in enumerate(target_text_ls):
            #1- prepare the decoder input
            #To obtain the ids of decoder sentence's tokens from target_token_index
            decoder_input_data[i, t] = target_token_index[tok]    
            
            # To obtain the decoder output
            if t > 0:
                # decoder_target_data will be ahead by one timestep
                # and will not include the start character.
                #We put 1 in the place of the expected word
                decoder_target_data[i, t - 1, target_token_index[tok]] = 1.
        decoder_input_data[i, t+1:] = target_token_index[pad_tok] 
        decoder_target_data[i, t:, target_token_index[pad_tok]] = 1.          
              
    return encoder_input_data, decoder_input_data, decoder_target_data              

In [164]:
encoder_input_data, decoder_input_data, decoder_target_data = vectorize(tokenized_ds_copy, max_encoder_seq_length, max_decoder_seq_length, num_decoder_tokens)

In [165]:
encoder_input_data[0]

array([5490.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.], dtype=float32)

In [166]:
input_token_index['مرحبا']

5490

Note: 

The 1st sentence in the Arabic_input column is 'مرحبا', and there is no any other word inside the sentence. If we printed the 1st place in encoder_input_data, we will find that there is only one word, this word took the same number of 'مرحبا 

In [167]:
decoder_input_data[0]

array([   5., 1802.,    6.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.], dtype=float32)

In [168]:
target_token_index['hi']

1802

In [169]:
target_token_index['START_']

5

In [170]:
target_token_index['_END']

6

Note:

    The 1st sentence in the decoder input is START_ hi _END, so there are only three tokens, each token has id. If we check the tokens ids inside the target_token_index, then we find that the decoder_input_data has the similar number in the location 0
    

In [171]:
# The _END location
decoder_target_data[0][1][0:10]

array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], dtype=float32)

In [172]:
# The hi location
decoder_target_data[0][0][1800:1810]

array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

Note:

what should appear in the output of the decoder in this case is 'hi _END', we will find hi on gate 0 in the 3rd location, and _END on gate  1 n the 6th location.

### Mdeling

Here I will implement the encoder separated from decoder to simplify the operation. 

In [192]:
def build_training_encoder(num_encoder_tokens, emb_sz, lstm_sz, mask_zero):
    # 1- Define the input to the encoder
    encoder_inputs = Input(shape=(None,))
    
    # 2- Define the embedding layer. This layer is built on the previous layer the input to the encoder.
    # This embedding layer need the following parameters 1- all the expected encoder tokens number. 2- embedding size of each word
    # The vector of each word output from embedding layer has this word charcteristics.
    en_x=  Embedding(num_encoder_tokens, emb_sz,mask_zero=mask_zero)(encoder_inputs)
    
    # 3- Define the 1st LSTM layer, 
    #This layer parametrs are:
    # 1- reurn_satate which enable us to output the cell state and the hidden state
    # 2- lstm_sz: which is the number of hidden units inside the LSTM layer.
    encoder = LSTM(lstm_sz, return_state=True)
    
    # 4- Put the encoder LSTM on the embedding layer output 
    #and take the output of this LSTM which will be used to build the context vector
    encoder_outputs, state_h, state_c = encoder(en_x)
    
    # We discard `encoder_outputs` and only keep the states.
    encoder_states = [state_h, state_c]
  
    # Encoder model
    encoder_model = Model(encoder_inputs, encoder_states)
    print('\n The encoder model \n')
    encoder_model.summary()
  
    return encoder_model, encoder_states, encoder_inputs

In [193]:
# Set up the decoder, using `encoder_states` as initial state.
    
def build_training_decoder(num_decoder_tokens, emb_sz, lstm_sz, encoder_states, encoder_inputs, mask_zero):
        
    #1- define the input layer to the decoder.
    decoder_inputs = Input(shape=(None,))

    # 2- define the embeddding layer for the decoder/ In training it will take the expected tokens from the dataset
    # The embedding layer parameters are:
    # 1- All the expected tokens to the decoder.
    # 2- the embedding size
    decoder_embedding=  Embedding(num_decoder_tokens, emb_sz,mask_zero=True)

    # 3- put the layer of embedding on the layer of decoder inputs. 
    embedding_output= decoder_embedding(decoder_inputs)

    
    #4- Define the LSTM which should output the hidden state and cell state and cell output for each cell.
    # the hidden state and cell state which can be done enabling return_state
    #  The output sequence of all the input sequence tokens can be done by enabling the return_sequences 
    decoder_lstm = LSTM(lstm_sz, return_sequences=True, return_state=True)

    
    # 5- Put the LSTM on the top of embedding layer and feeed the context vector to the LSTM 'encoder_states'
    decoder_outputs, _, _ = decoder_lstm(embedding_output, initial_state=encoder_states)

    #6- Define the fully connected layer which predicts the output token.
    # This layer will output a vector. This vector has a probability for each expected token to output. 
    #Then this layer feeds this to an activation function 'softmax' to decide which word sould output at each timestep.
    decoder_dense = Dense(num_decoder_tokens, activation='softmax')

    # 6-Put the dense/fully connected layer on the top of lstm 
    #and feed the lstm output vector for all input tokens to the dense.
    decoder_outputs = decoder_dense(decoder_outputs)
    
    # 7-Here build the combined model which takes the training tokens to the encoder and decoder and output the decoder output. 
    # This decoder output comes after the softmax of the dense layer
    combined_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    
    print('\n The combined model \n')
    combined_model.summary()
    
    return combined_model, decoder_inputs,embedding_output, decoder_lstm, decoder_dense



In [194]:
# This function will use the same architecture of decoder in the training pahse, so we will use the followiing
# 1- decoder_input.
# 2- decoder_embedding_output. 
# 3- decoder_lstm
#4- decoder_dense
#why will we use the same architecture? since we will use the same cells but by using the inference methodology which is mentioned above.
def build_inference_decoder(num_decoder_tokens, lstm_sz, emb_sz, embedding_output, decoder_inputs, decoder_lstm, decoder_dense):
    
    # Decoder model: Re-build based on explicit state inputs. Needed for step-by-step inference:
    
    # define the hidden state of the context vector which will come from the encoder in the prediction
    decoder_state_input_h = Input(shape=(lstm_sz,))
    # define the cell state of the context vector which will come from the encoder in the prediction
    decoder_state_input_c = Input(shape=(lstm_sz,))
    
    #define the conext vector which will feed into the decoder to initialize it. The values of this will be feed in the prediction.
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    # feed the decoder LSTM with embedding output, and inilize its state to decoder_states_inputs 
    #which will be fed after that with the encoder context vector.
    decoder_outputs2, state_h2, state_c2 = decoder_lstm(embedding_output, initial_state=decoder_states_inputs)
    
    # Define the input to the dense layer
    decoder_states2 = [state_h2, state_c2]
    
    # Feed the hidden and cell state to the dense layer which will predict the output.
    decoder_outputs2 = decoder_dense(decoder_outputs2)
    #define the decoder_model which will take take the decoder inputs and initial state for the decoder.
    # this model will output the decoder output token in addition to the hidden and cell states.
    decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2) 
    
    return decoder_model



In [176]:
   
emb_sz = 50
lstm_sz = 64
 
def model_seq_to_seq(batch_size, epochs, mask_zero):
    
    encoder_model, encoder_states, encoder_inputs = build_training_encoder(num_encoder_tokens, emb_sz, lstm_sz, mask_zero)
    combined_model, decoder_inputs,embedding_output, decoder_lstm, decoder_dense = build_training_decoder(num_decoder_tokens,
                                                                                                          emb_sz, lstm_sz, 
                                                                                                          encoder_states, 
                                                                                                          encoder_inputs,mask_zero)
    
    decoder_model = build_inference_decoder(num_decoder_tokens, lstm_sz, emb_sz, embedding_output, decoder_inputs,
                                            decoder_lstm, decoder_dense)

    
    # 8- compile the combined model in training phase
    combined_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

    combined_model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=batch_size,
                       epochs=epochs, validation_split=0.05)
    
    return combined_model, encoder_model, decoder_model

In [137]:
combined_model, encoder_model, decoder_model = model_seq_to_seq(batch_size=64, epochs=30, mask_zero=False)
combined_model.save_weights("translatemodel.h5")



 The encoder model 

Model: "model_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        [(None, None)]            0         
_________________________________________________________________
embedding_12 (Embedding)     (None, None, 50)          360450    
_________________________________________________________________
lstm_12 (LSTM)               [(None, 64), (None, 64),  29440     
Total params: 389,890
Trainable params: 389,890
Non-trainable params: 0
_________________________________________________________________

 The combined model 

Model: "model_13"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_13 (InputLayer)           [(None, None)]       0                                            
______________________

### Inferrence

In [177]:
def decode_sequence(input_seq, sep = ' '):
    # to obtain the encoder model
    # 1- Encode the input to obtain the context vector from the encoder.
    states_value = encoder_model.predict(input_seq)
    
    # Generate empty target sequence of length 1 like this [[0]]
    target_seq = np.zeros((1,1))
    
    # Populate the first character of target sequence with the start character to be like [[5]]
    target_seq[0, 0] = target_token_index[st_tok]

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    # if you don't find end token
    while not stop_condition:
        #feed the predict() with the input to the decoder model which is target_seq and
        #the context vector from the encode which is states_value
        # output_tokens will hold the last output of the LSTM
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        

        # Sample a token 
        # It returns the index of the maximum item in the 1st array last row
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        # return the word as letters from its index
        sampled_tok = reverse_target_tok_index[sampled_token_index]
        # form the sentence which consists of words and separatoe
        decoded_sentence += sep + sampled_tok

        # Exit condition: either hit max length which is 64
        # or find stop character.
        if (sampled_tok == end_tok or len(decoded_sentence) > 64):
            stop_condition = True

        # Update the target sequence (of length 1) with the index of the output token.
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states to be feed to the prediction
        states_value = [h, c]
 
    return decoded_sentence

    

In [179]:
for seq_index in range(100): #[14077,20122,40035,40064, 40056, 40068, 40090, 40095, 40100, 40119, 40131, 40136, 40150, 40153]:
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', dataset.Arabic_input[seq_index: seq_index + 1])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: 0    مرحبا
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot of him _END
-
Input sentence: 1    اركض
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot _END
-
Input sentence: 2    اخفض رأسك
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot of the table _END
-
Input sentence: 3    اخفضي رأسك
Name: Arabic_input, dtype: object
Decoded sentence:  this is a beautiful country _END
-
Input sentence: 4    اخفضوا رؤوسكم
Name: Arabic_input, dtype: object
Decoded sentence:  this is a good car _END
-
Input sentence: 5    النجده
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot _END
-
Input sentence: 6    اقفز
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot of the table _END
-
Input sentence: 7    قف
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot of him _END
-
Input sentence: 8    توقف 
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot _END
-
Input sentence: 9    إنتظر
Name

-
Input sentence: 75    من فاز
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot of him _END
-
Input sentence: 76    من الذي ربح
Name: Arabic_input, dtype: object
Decoded sentence:  he is a lot of him _END
-
Input sentence: 77    لم لا
Name: Arabic_input, dtype: object
Decoded sentence:  why do you want _END
-
Input sentence: 78    لما لا
Name: Arabic_input, dtype: object
Decoded sentence:  i dont have _END
-
Input sentence: 79    لا فكره لدي
Name: Arabic_input, dtype: object
Decoded sentence:  i dont have a doctor _END
-
Input sentence: 80    استمتع بوقتك
Name: Arabic_input, dtype: object
Decoded sentence:  i am going to the hospital _END
-
Input sentence: 81    أسرعا
Name: Arabic_input, dtype: object
Decoded sentence:  this house is very _END
-
Input sentence: 82    لقد نسيت
Name: Arabic_input, dtype: object
Decoded sentence:  i am not going _END
-
Input sentence: 83    فهمته
Name: Arabic_input, dtype: object
Decoded sentence:  i am very hungry _END
-
Input sentence: 8

### changing the epochs and notice its effect on the decoder output

In [197]:
emb_sz = 50
lstm_sz = 256
 
def model_seq_to_seq(batch_size, epochs,mask_zero):
    
    encoder_model, encoder_states, encoder_inputs = build_training_encoder(num_encoder_tokens, emb_sz, lstm_sz,mask_zero)
    combined_model, decoder_inputs,embedding_output, decoder_lstm, decoder_dense = build_training_decoder(num_decoder_tokens,
                                                                                                          emb_sz, lstm_sz, 
                                                                                                          encoder_states, 
                                                                                                          encoder_inputs, 
                                                                                                          mask_zero)
    
    decoder_model = build_inference_decoder(num_decoder_tokens, lstm_sz, emb_sz, embedding_output, decoder_inputs,
                                            decoder_lstm, decoder_dense)

    
    # 8- compile the combined model in training phase
    combined_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

    combined_model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=batch_size, epochs=epochs, validation_split=0.2)
    
    return combined_model, encoder_model, decoder_model

In [198]:
combined_model, encoder_model, decoder_model = model_seq_to_seq(batch_size=64, epochs=100, mask_zero=True)


 The encoder model 

Model: "model_27"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_33 (InputLayer)        [(None, None)]            0         
_________________________________________________________________
embedding_22 (Embedding)     (None, None, 50)          360450    
_________________________________________________________________
lstm_22 (LSTM)               [(None, 256), (None, 256) 314368    
Total params: 674,818
Trainable params: 674,818
Non-trainable params: 0
_________________________________________________________________

 The combined model 

Model: "model_28"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_33 (InputLayer)           [(None, None)]       0                                            
______________________

Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100


Epoch 100/100


In [199]:
for seq_index in range(100): #[14077,20122,40035,40064, 40056, 40068, 40090, 40095, 40100, 40119, 40131, 40136, 40150, 40153]:
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', dataset.Arabic_input[seq_index: seq_index + 1])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: 0    مرحبا
Name: Arabic_input, dtype: object
Decoded sentence:  welcome _END
-
Input sentence: 1    اركض
Name: Arabic_input, dtype: object
Decoded sentence:  run _END
-
Input sentence: 2    اخفض رأسك
Name: Arabic_input, dtype: object
Decoded sentence:  duck _END
-
Input sentence: 3    اخفضي رأسك
Name: Arabic_input, dtype: object
Decoded sentence:  duck _END
-
Input sentence: 4    اخفضوا رؤوسكم
Name: Arabic_input, dtype: object
Decoded sentence:  duck _END
-
Input sentence: 5    النجده
Name: Arabic_input, dtype: object
Decoded sentence:  help _END
-
Input sentence: 6    اقفز
Name: Arabic_input, dtype: object
Decoded sentence:  jump _END
-
Input sentence: 7    قف
Name: Arabic_input, dtype: object
Decoded sentence:  stand up _END
-
Input sentence: 8    توقف 
Name: Arabic_input, dtype: object
Decoded sentence:  stop _END
-
Input sentence: 9    إنتظر
Name: Arabic_input, dtype: object
Decoded sentence:  wait _END
-
Input sentence: 10    داوم
Name: Arabic_input, dtype: objec

-
Input sentence: 85    فهمت ذلك
Name: Arabic_input, dtype: object
Decoded sentence:  i got it _END
-
Input sentence: 86    أستخدمه
Name: Arabic_input, dtype: object
Decoded sentence:  im using it _END
-
Input sentence: 87    سأدفع أنا
Name: Arabic_input, dtype: object
Decoded sentence:  ill pay _END
-
Input sentence: 88    أنا مشغول
Name: Arabic_input, dtype: object
Decoded sentence:  im not free _END
-
Input sentence: 89    إنني مشغول
Name: Arabic_input, dtype: object
Decoded sentence:  im busy _END
-
Input sentence: 90    أشعر بالبرد
Name: Arabic_input, dtype: object
Decoded sentence:  i am cold _END
-
Input sentence: 91    أنا حر
Name: Arabic_input, dtype: object
Decoded sentence:  im free _END
-
Input sentence: 92    أنا هنا
Name: Arabic_input, dtype: object
Decoded sentence:  im here _END
-
Input sentence: 93    لقد عدت إلى البيت
Name: Arabic_input, dtype: object
Decoded sentence:  im home _END
-
Input sentence: 94    أنا فقير
Name: Arabic_input, dtype: object
Decoded sentence:  

Note: Here although the accuracy is a small value. But if we notice the translation is correct. The accuracy in problems of sequence to sequence can't be cosidered as a metric.

## References:

- https://github.com/motazsaad/process-arabic-text/blob/master/clean_arabic_text.py
- https://thescipub.com/pdf/jcssp.2020.117.125.pdf
- Hands -On Python Natural Language Processing Book for Aman Kedia and Mayank Rasu
- https://colab.research.google.com/drive/1dhlc3Nt_LvZcxY5tUd-XLU1fvGt4pPuh?usp=sharing
- https://github.com/CAMeL-Lab/camel_tools#installing-data
- https://colab.research.google.com/drive/1Y3qCbD6Gw1KEw-lixQx1rI6WlyWnrnDS?usp=sharing#scrollTo=9knGLLGg7cnm
