<a href="https://colab.research.google.com/github/ArazShilabin/english-to-persian-translator/blob/main/en_fa_translator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### mount google drive

In [None]:
""" 
Use this javascript code in inspect>console so you wont need to click the page every 15 min:

########################
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);
########################

"""

from google.colab import drive
drive.mount('/content/drive')
%pwd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


'/content/drive/My Drive/projects/persian_english_translator'

### change current path to where the working project folder is at

In [None]:
%cd drive/MyDrive/projects/persian_english_translator/

[Errno 2] No such file or directory: 'drive/MyDrive/projects/persian_english_translator/'
/content/drive/MyDrive/projects/persian_english_translator


# Step 0: Get The Data

### upload the data to our current path and unzip it (uncomment and run this only once)

In [None]:
# # persian data
# %mkdir -p data  # make dir if doesn't exist
# %cd data
# !wget https://github.com/omidkashefi/Mizan/raw/master/mizan.zip
# !unzip mizan.zip
# %cd ..
# %pwd


### Get non breaking prefixes

In [None]:
# get non_breaking_prefixes from https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes
# then rename them to: "nonbreaking_prefix.en" and "nonbreaking_prefix.de" and put them in your data folder so we dont consider the
# dot in 'mr.jackson' as the end of a sentence

# Step 1: Importing Dependencies

In [None]:
import numpy as np
import math
import re
import time # to see how long it takes in training

In [None]:
%tensorflow_version 2.x

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds # tools for the tokenizer

# Step 2: Data Preprocessing

## read files

In [None]:
with open("data/mizan/mizan_en.txt", mode='r', encoding="utf-8") as f:
    text_en = f.read()

with open("data/mizan/mizan_fa.txt", mode='r', encoding="utf-8") as f:
    text_fa = f.read()


print(text_en[:50])
print(text_fa[:50])

The story which follows was first written out in P
داستانی که از نظر شما می‌گذرد، ابتدا ضمن کنفرانس ص


## Cleaning

### English: <br> 1. All lowercase <br> 2. Split by "\\n" to a list of sentences



In [None]:
text_en = text_en.lower()
text_en = text_en.split("\n")
text_en = [re.sub(r"[^a-zA-Z\n]"," ", i) for i in text_en]
text_en = [re.sub(' +', ' ', i).strip() for i in text_en]

print(len(text_en))
print(text_en[2])


1021596
afterwards in the autumn of this first draft and some of the notes were lost


### Farsi: <br> 1. Turning half-spaces to full-spaces <br> 2. Split by "\\n" to a list of sentences


In [None]:
text_fa = text_fa.replace("\u200c", " ")
text_fa = text_fa.split("\n")
text_fa = [re.sub(r"[^آ-ی]"," ", i) for i in text_fa]
text_fa = [re.sub(' +', ' ', i).strip() for i in text_fa]

print(len(text_fa))
print(text_fa[2])

1021596
بعدا در پائیز سال این نوشته اولیه و بعضی از یادداشت ها مفقود شدند


### Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

text_en, text_en_test, text_fa, text_fa_test = train_test_split(text_en, text_fa, test_size=0.2, random_state=42)



### Tokenizing

In [None]:
tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    text_en, target_vocab_size=8000)
print("done EN")

tokenizer_fa = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    text_fa, target_vocab_size=8000)
print("done FA")

done EN
done FA


In [None]:
VOCAB_SIZE_EN = tokenizer_en.vocab_size + 2
VOCAB_SIZE_FA = tokenizer_fa.vocab_size + 2

# we put start and end tokens as size-1 and size-2 which are the same as
# tokenizer_size and tokenizer_size+1 because the words are from [0 to ts-1]
# tokenize_en.encode(sentence) give a list then list + list + list appends them
inputs = [[VOCAB_SIZE_EN-2] + tokenizer_en.encode(sentence) + [VOCAB_SIZE_EN-1]
          for sentence in text_en]
outputs = [[VOCAB_SIZE_FA-2] + tokenizer_fa.encode(sentence) + [VOCAB_SIZE_FA-1]
          for sentence in text_fa]


### Remove too long sentences
Why?<br>
(1) because when we pad we will have a hugeeee ram issuie for example sentence sizes of 1,100,2 when we pad they become 100,100,100 which we would rather loose that 100 than pad all to 100<br>
(2) takes too much time to train

In [None]:
MAX_LENGTH = 20 # we will still have a lot of data with max len of 20

# this part, why we do it is a bit tricky, pay attention why we do it like this:
idx_to_remove = [count for count, sent in enumerate(inputs)
                 if len(sent) > MAX_LENGTH]
# we remove in reversed because of shifting issuies when we start from begining
for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]

# same stuff for outputs>20
idx_to_remove = [count for count, sent in enumerate(outputs)
                 if len(sent) > MAX_LENGTH]
for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]

### input/output creation
1) padding  
2) batching

In [None]:
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs,
                                                       value=0,
                                                       padding='post',
                                                       maxlen=MAX_LENGTH)
outputs = tf.keras.preprocessing.sequence.pad_sequences(outputs,
                                                       value=0,
                                                       padding='post',
                                                       maxlen=MAX_LENGTH)

In [None]:
BATCH_SIZE = 64
BUFFER_SIZE = 20000 # how much data to keep

# now we turned our data into a dataset
dataset = tf.data.Dataset.from_tensor_slices((inputs, outputs))

# this is something that improves the way the dataset is stored, it increases
# the speed of accessing the data which increases training speed in return:
dataset = dataset.cache()

# now we shuffle in batches
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

# this increases the speed even further:
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)


# Step 3: Model Building

## A - Positional Encoding (look at the formula in the paper)

In [None]:
class PositionalEncoding(layers.Layer):
    
    def __init__(self):
        # this Positional Encoder we made it a child of the Layers so it has all
        # the properties that a layer has
        super(PositionalEncoding, self).__init__()

    def get_angles(self, pos, i, d_model):
        """
        :pos: (seq_len, 1) index of the word in sentence [0 to 19]
        :i: the dimensions of the embedding (glove dims 200) then-> [0 to 199]
        :d_model: the size (dimension) of the embeded (e.g. glove size 200)
        :return: (seq_len, d_model) why? we are getting the encoding of the
                every positions vs every one of the dimensions of that word
        """
        angles = 1 / np.power(10000., (2*(i//2))/np.float32(d_model))
        return pos * angles # dim: (seq_len, d_model)

    def call(self, inputs):
        # input.shape = [batch_size, multihead_size(sz=8), each word (pos), that words embedding]
        # keep in mind we DONT change the values of the input considering 
        # their positions, we just get the dims from input and calculate
        # pos encoding totally seperatly and stack them at the end
        seq_length = inputs.shape.as_list()[-2] # basically the pos
        d_model = inputs.shape.as_list()[-1] # basically the embedded values
        angles = self.get_angles(np.arange(seq_length)[:, np.newaxis],
                                 np.arange(d_model)[np.newaxis, :],
                                 d_model)
        angles[:, 0::2] = np.sin(angles[:, 0::2])
        angles[:, 1::2] = np.cos(angles[:, 1::2])
        # we do this because it has a [batch] dimension at the begining we add
        # it. why? because inputs and the encodings need to be same dims so we
        # make newaxis which it doesnt put 0's.... it copies those same dims for
        # all the batches...
        pos_encoding = angles[np.newaxis, ...]
        # now we need to return both the inputs and their pos_encodings
        # but we have pos_encoding in np so we make them tf
        return inputs + tf.cast(pos_encoding, tf.float32)

## B - Attention

### Attention computation (see the formula in the paper)

In [None]:
def scaled_dot_product_attention(queries, keys, values, mask):
    # Q*K will be [output_len, d_model] * [d_model, input_len] which both are 20
    # for both english and french
    # the transpose_b=True makes keys turn to keys.T
    # each of them are this dim: [batch_size, nb_proj, seq_len, d_proj]
    # so with transpose it become: [a,b,c,d] * []
    product = tf.matmul(queries, keys, transpose_b=True)
    keys_dim = tf.cast(tf.shape(keys)[-1], tf.float32) # makes the dim_num float
    scaled_product = product / tf.math.sqrt(keys_dim) # scales it (formula stuff)

    # because this mask as the paper said is optional to prevent the program
    # from seeing the feauture. why? because when we backprop then they will
    # consider the stuff in front of them so to stop this we add -1e9 to them
    # so after softmax the probabilities become 0 for them
    if mask is not None:
        scaled_product += (mask * -1e9)
    
    # we apply the softmax along the last axis because we want their sum to be 1
    # scaled_product = [output_len, input_len] -> softmax on input_len so
    # basically we are keeping in_len the same but finding the probs for out_len
    # so for every ins what is the prob of each of the outs
    # (e.g. ith input, the probs [0.3,0.7] of the outs)
    probs = tf.nn.softmax(scaled_product, axis=-1)

    # attention = [output_len, input_len] * [input_len, d_model] = [output_len, d_model]
    # so now we have d_model weights for each of the output words which we will
    # feed to forwards to see their prediction for each of the out_lens
    attention = tf.matmul(probs, values)

    return attention

In [None]:
# import numpy as np
# # this is just a test for you to see what happens in this line of code to the
# # dims (which we realized from the 4 dims, it only transposed the last two and
# # did mult only on those last two because matmul considers the other dims as
# # batch size and other stuff) (tf.matmul(a, b, transpose_b=True))
# a = np.arange(24).reshape(1,2,3,4)
# a = tf.convert_to_tensor(a, np.float32)
# b = np.arange(24).reshape(1,2,3,4)
# b = tf.convert_to_tensor(b, np.float32)
# product = tf.matmul(a, b, transpose_b=True)
# print(product.shape)


### Multi-Head attention sublayer

In [None]:
class MultiHeadAttention(layers.Layer):
    def __init__(self, nb_proj):
        """
        :nb_proj: the number of projections for the multihead
        """
        super(MultiHeadAttention, self).__init__()
        self.nb_proj = nb_proj
    
    # this is the same as init but it happens when we USE the object for the
    # first time, in init it was called when we CREATED the object
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        # we wanna make sure they are divisible
        assert self.d_model % self.nb_proj == 0
        # we use 2 slashes to make it integer
        self.d_proj = self.d_model // self.nb_proj

        self.query_lin = layers.Dense(self.d_model)
        self.key_lin = layers.Dense(self.d_model)
        self.value_lin = layers.Dense(self.d_model)
        self.final_lin = layers.Dense(self.d_model)
    
    def split_proj(self, inputs, batch_size):
        """
        :inputs: [batch_size, seq_len(20), d_model(prev layer dim)]

        :return: 
            dims = [batch_size, nb_proj, seq_len, d_proj]
            nb_proj here is like channels in cnn
            we basically split the d_model to nb_proj*d_proj so d_proj is
            found by doing d_model/nb_proj
        """
        new_shape = (batch_size, -1, self.nb_proj, self.d_proj)
        # here we will get: [batch_sz, seq_len, nb_proj, d_proj]

        splited_inputs = tf.reshape(inputs, shape=new_shape)

        # so we need to reshape it to: [batch_size, nb_proj, seq_len, d_proj]
        return tf.transpose(splited_inputs, perm=[0, 2, 1, 3])


    def call(self, queries, keys, values, mask):
        batch_size = tf.shape(queries)[0]

        queries = self.query_lin(queries)
        keys = self.key_lin(keys)
        values = self.value_lin(values)

        # now we split each of them to make projs
        queries = self.split_proj(queries, batch_size)
        keys = self.split_proj(keys, batch_size)
        values = self.split_proj(values, batch_size)

        # each of the q,k,v are [batch_size, nb_proj, seq_len, d_proj]
        attention = scaled_dot_product_attention(queries, keys, values, mask)

        # now we will reverse the splits we did above: reshape + concat
        attention = tf.transpose(attention, perm=[0,2,1,3])
        # we have [batch_size, seq_len, nb_proj, d_proj] so now we concat 2, 3
        concat_attention = tf.reshape(attention, shape=(batch_size, -1, self.d_model))
        outputs = self.final_lin(concat_attention)
        return outputs

## C - Encoder

In [None]:
class EncoderLayer(layers.Layer):

    def __init__(self, FFN_units, nb_proj, dropout):
        """
        :FFN_units:
            feed forward networks units: the number of units for the
            feed forward which you can see in the encoder part of the
            paper (right after the attention there is a feed forward...)
        :nb_project: 
            the number of projections we have (8)
        :dropout:
            the dropout rate e.g. 0.3
        """
        super(EncoderLayer, self).__init__()
        self.FFN_units = FFN_units
        self.nb_proj = nb_proj
        self.dropout = dropout


    # we use this because we dont have many of the vars we want
    # when we create the Encoder, so no we can get them when we use the
    # function using 'build' instead
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        # we first build the object for the multi-head-attention
        self.multi_head_attention = MultiHeadAttention(self.nb_proj)
        self.dropout_1 = layers.Dropout(rate=self.dropout)
        self.norm_1 = layers.LayerNormalization(epsilon=1e-6)

        self.dense_1 = layers.Dense(units=self.FFN_units, activation="relu")
        self.dense_2 = layers.Dense(units=self.d_model, activation="relu")
        self.dropout_2 = layers.Dropout(rate=self.dropout)
        self.norm_2 = layers.LayerNormalization(epsilon=1e-6)


    def call(self, inputs, mask, training):
        """
        :mask: which we will apply in the multi-head attention
        :training: 
            it is true/false which we use dropout while we train=true to stop
            the model from overfiting but we dont use it when we are just
            testing (aka. train=false)
        """
        # if you look at the architecture you see that in the encoder
        # all of the query/key/val are the same array which is the input we
        # got from the previous layer
        attention = self.multi_head_attention(inputs, inputs, inputs, mask)

        # dropout + normalization after the attention
        attention = self.dropout_1(attention, training=training)
        # we do + inputs here because in the architecture they still concat the
        # previous inputs to our resulted attention then we normalize it
        attention = self.norm_1(attention + inputs)
        
        # now we do the dense in our FFN:
        outputs = self.dense_1(attention)
        outputs = self.dense_2(outputs)
        outputs = self.dropout_2(outputs)
        outputs = self.norm_2(outputs + attention)

        return outputs


In [None]:
class Encoder(layers.Layer):

    def __init__(self,
                 nb_encoding_layers,
                 FFN_units,
                 nb_proj,
                 dropout,
                 vocab_size,
                 d_model,
                 name="encoder"):
        # we put name=name here because the name is something that belongs
        # to the layers class, so we tell it to use name="encoder"
        super(Encoder, self).__init__(name=name)
        self.nb_encoding_layers = nb_encoding_layers # the number of encoders in a row
        self.d_model = d_model # the size of the output e.g. glove(200)

        # we give vocab size for it to know the maximum number used in vocab
        self.embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding()
        self.dropout = layers.Dropout(rate=dropout)
        self.enc_layers = [EncoderLayer(FFN_units, nb_proj, dropout)
                        for _ in range(self.nb_encoding_layers)]


    def call(self, inputs, mask, training):
        # look at the paper's architecture while doing these
        # embedding with maybe glove weights....
        outputs = self.embedding(inputs)
        # the reason why we did this was because of what was writtent
        # on the paper in secssion 3.4 which they said they multiplied
        # it by sqrt of d_model
        outputs *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        # this will give us the concat: outputs + pos_encoding
        outputs = self.pos_encoding(outputs)
        # now we do dropout before all the encoding layers
        # we give it training=bool -> so dont do dropout when training=false
        outputs = self.dropout(outputs, training)
        
        # now we do the EmbeddingLayer a couple of times, not just once.
        for i in range(self.nb_encoding_layers):
            # so we apply it to the (i)th encoder in each for with these params:
            outputs = self.enc_layers[i](outputs, mask, training)
        
        return outputs



## D - Decoder

In [None]:
class DecoderLayer(layers.Layer):

    def __init__(self, FFN_units, nb_proj, dropout):
        super(DecoderLayer, self).__init__()
        self.FFN_units = FFN_units
        self.nb_proj = nb_proj
        self.dropout = dropout

    def build(self, input_shape):
        self.d_model = input_shape[-1]

        # MHA 1
        self.multi_head_attention_1 = MultiHeadAttention(self.nb_proj)
        self.dropout_1 = layers.Dropout(rate=self.dropout)
        self.norm_1 = layers.LayerNormalization(epsilon=1e-6)
    
        # MHA 2
        self.multi_head_attention_2 = MultiHeadAttention(self.nb_proj)
        self.dropout_2 = layers.Dropout(rate=self.dropout)
        self.norm_2 = layers.LayerNormalization(epsilon=1e-6)

        # FFN
        self.dense_1 = layers.Dense(units=self.FFN_units, activation='relu')
        self.dense_2 = layers.Dense(units=self.d_model)
        self.dropout_3 = layers.Dropout(rate=self.dropout)
        self.norm_3 = layers.LayerNormalization(epsilon=1e-6)

    def call(self, inputs, enc_outputs, mask_1, mask_2, training):
        # check the architecture in the paper to see why we do these

        # this is the 1# attention
        attention = self.multi_head_attention_1(inputs, inputs, inputs, mask_1)
        # we give it training=bool -> so dont do dropout when training=false
        attention = self.dropout_1(attention, training)
        attention = self.norm_1(attention + inputs)

        # this is the 2# attention, this is ALOT different than before one
        # pay attention to it's inputs
        attention_2 = self.multi_head_attention_2(attention,
                                                enc_outputs,
                                                enc_outputs,
                                                mask_2)
        # we give it training=bool -> so dont do dropout when training=false
        attention_2 = self.dropout_2(attention_2, training)
        attention_2 = self.norm_2(attention_2 + inputs)

        # the denses
        outputs = self.dense_1(attention_2)
        outputs = self.dense_2(outputs)
        outputs = self.dropout_3(outputs, training)
        outputs = self.norm_3(outputs + attention_2)

        return outputs
        

In [None]:
class Decoder(layers.Layer):

    def __init__(self,
                 nb_decoding_layers,
                 FFN_units,
                 nb_proj,
                 dropout,
                 vocab_size,
                 d_model,
                 name="decoder"):
        super(Decoder, self).__init__(name=name)
        self.nb_decoding_layers = nb_decoding_layers # the number of encoders in a row
        self.d_model = d_model # the size of the output e.g. glove(200)

        # we give vocab size for it to know the maximum number used in vocab
        self.embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding()
        self.dropout = layers.Dropout(rate=dropout)
        
        self.dec_layers = [DecoderLayer(FFN_units, nb_proj, dropout)
                        for _ in range(nb_decoding_layers)]


    def call(self, inputs, enc_outputs, mask_1, mask_2, training):
        # look at the paper's architecture while doing these
        # embedding with maybe glove weights....
        outputs = self.embedding(inputs)
        # the reason why we did this was because of what was writtent
        # on the paper in secssion 3.4 which they said they multiplied
        # it by sqrt of d_model
        outputs *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        # this will give us the concat: outputs + pos_encoding
        outputs = self.pos_encoding(outputs)
        # now we do dropout before all the encoding layers
        # we give it training=bool -> so dont do dropout when training=false
        outputs = self.dropout(outputs, training)
        
        # now we do the EmbeddingLayer a couple of times, not just once.
        for i in range(self.nb_decoding_layers):
            # so we apply it to the (i)th encoder in each for with these params:
            outputs = self.dec_layers[i](outputs,
                                             enc_outputs,
                                             mask_1,
                                             mask_2,
                                             training)
        return outputs

## E - Transformer

In [None]:
class Transformer(tf.keras.Model):
    
    def __init__(self,
                 vocab_size_enc,
                 vocab_size_dec,
                 d_model,
                 nb_layers,
                 FFN_units,
                 nb_proj,
                 dropout,
                 name="transformer"):
        super(Transformer, self).__init__(name=name)

        # initing the Objects
        self.encoder = Encoder(nb_layers,
                               FFN_units,
                               nb_proj,
                               dropout,
                               vocab_size_enc,
                               d_model)
        self.decoder = Decoder(nb_layers,
                               FFN_units,
                               nb_proj,
                               dropout,
                               vocab_size_dec,
                               d_model)
        # this is at the very end after you combined the enc & dec the output
        # will of size vocab_dec
        self.last_linear = layers.Dense(units=vocab_size_dec)
        

    def create_padding_mask(self, seq):
        """
        :seq: [batch_size, seq_len(20)]
        :return:
        """
        # so this gives us element wise equal true/false with broadcasting on 0
        # so we just want to see which words dont exist to give it true in all
        # the batches
        mask = tf.cast(tf.math.equal(seq, 0), tf.float32)
        # so now we will return that mask we made but with 2 broadcasted new 
        # dimensions so it can match the input needed in attention
        # (in the next cell I made and example to see it better)
        return mask[:, tf.newaxis, tf.newaxis, :]

    def create_look_ahead_mask(self, seq):
        """
        :seq: [batch_size, seq_len(20)]
        :return:
        """
        seq_len = tf.shape(seq)[1]
        # sample of what it produces is in the cell below
        # why do we do this? because when we predict the ith word we dont see
        # words from
        look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
        return look_ahead_mask

    def call(self, enc_inputs, dec_inputs, training):
        # combining the encoder and decoder and masks here

        # creating the mask for encoder
        enc_mask = self.create_padding_mask(enc_inputs)

        # creating the mask for decoder

        # mask #1 is for the first decoder attention which uses the
        # output, output, output as q/k/v so we get max of the 2 masks for it
        dec_mask_1 = tf.maximum(self.create_padding_mask(dec_inputs),
                                self.create_look_ahead_mask(dec_inputs))
        # mask #2 is for the second decoder attention in which we use the
        # output of the encoder as v/k so we need to do masking on the input
        # so that later when doing q*k then *v we can get a correct output
        # this is what the video said, but i belive making this None is alot                  # try making this none later
        # more correct since we dont actually use the inputs and outputs but
        # their already masked and processed outputs from previous attentions
        dec_mask_2 = self.create_padding_mask(enc_inputs)

        enc_outputs = self.encoder(enc_inputs, enc_mask, training)
        dec_outputs = self.decoder(dec_inputs,
                                   enc_outputs,
                                   dec_mask_1,
                                   dec_mask_2,
                                   training)
        
        outputs = self.last_linear(dec_outputs)
        
        return outputs

### testing masks to see their outputs

In [None]:
def create_padding_mask(seq):
    mask = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return mask[:, tf.newaxis, tf.newaxis, :]

def create_look_ahead_mask(seq):
    seq_len = tf.shape(seq)[1]
    look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
    return look_ahead_mask

In [None]:
# sample: 1 batch of seq_len=8 ->[1,8] this will become [1,1,1,8] then
# broadcasting will happen for later stages
# mask=1 means we should delete this
seq = tf.cast([[1, 2, 3, 0, 4, 0, 0, 0]], tf.int32)

print('This padding masking as the name says shows which words exist: (later broadcasting will happen for each batch & nb_proj & d_proj')
print(create_padding_mask(seq), end='\n\n')

print("This look ahead masking as the name says shows that only for i>=j we need to keep them (=0's), so we should not see the feature indxes (i<j) (the 1's)")
print('Have in mind that mask=1 means we need to get rid of that, dont confuse it with mask=0')
print(1 - tf.linalg.band_part(tf.ones((5, 5)), -1, 0), end='\n\n')

print('Now applying both: (pay very close attention to this samples output, very important)')
print(tf.maximum(create_padding_mask(seq),
                 create_look_ahead_mask(seq)))


This padding masking as the name says shows which words exist: (later broadcasting will happen for each batch & nb_proj & d_proj
tf.Tensor([[[[0. 0. 0. 1. 0. 1. 1. 1.]]]], shape=(1, 1, 1, 8), dtype=float32)

This look ahead masking as the name says shows that only for i>=j we need to keep them (=0's), so we should not see the feature indxes (i<j) (the 1's)
Have in mind that mask=1 means we need to get rid of that, dont confuse it with mask=0
tf.Tensor(
[[0. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1.]
 [0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]], shape=(5, 5), dtype=float32)

Now applying both: (pay very close attention to this samples output, very important)
tf.Tensor(
[[[[0. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 1. 1. 1. 1. 1.]
   [0. 0. 0. 1. 1. 1. 1. 1.]
   [0. 0. 0. 1. 0. 1. 1. 1.]
   [0. 0. 0. 1. 0. 1. 1. 1.]
   [0. 0. 0. 1. 0. 1. 1. 1.]
   [0. 0. 0. 1. 0. 1. 1. 1.]]]], shape=(1, 1, 8, 8), dtype=float32)


# Step 4: Training

## Parameters

In [None]:
tf.keras.backend.clear_session()

# Hyper-parameters:
EPOCHS = 1
D_MODEL = 128 # 512 takes more time but has a lot better results
NB_LAYERS = 4 # 6
FFN_UNITS = 512 # 2048
NB_PROJ = 8 # 8
DROPOUT = 0.1 # 0.1

transformer = Transformer(vocab_size_enc=VOCAB_SIZE_EN,
                          vocab_size_dec=VOCAB_SIZE_FA,
                          d_model=D_MODEL,
                          nb_layers=NB_LAYERS,
                          FFN_units=FFN_UNITS,
                          nb_proj=NB_PROJ,
                          dropout=DROPOUT)


## Loss

In [None]:
def custom_sparse_categorical_accuracy(y_true, y_pred):
    return K.cast(K.equal(K.max(y_true, axis=-1),
                          K.cast(K.argmax(y_pred, axis=-1), K.floatx())),
                  K.floatx())
    
def sparse_cross_entropy(y_true, y_pred):
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true,
                                                          logits=y_pred)
    return loss

In [None]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction='none')

def loss_function(target,prediction):
    mask = tf.math.logical_not(tf.math.equal(target,0))

    loss_ = loss_object(target,prediction)
    
    mask = tf.cast(mask,dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

## Optimizer

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):

    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        
        self.d_model = tf.cast(d_model, tf.float32)
        self.warmup_steps = warmup_steps

    # take notice that we used __call__ instead of call
    # and that the parameter steps we dont give it, we get it
    # from the tf.keras.optimizers.schedules.LearningRateSchedule itself
    def __call__(self, step):
        # read this part in paper and you'll understand arg1 & arg2 which they
        # used in their custom learning rate
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps**-1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(D_MODEL)
optimizer = tf.keras.optimizers.Adam(learning_rate,
                                     beta_1=0.9,
                                     beta_2=0.98,
                                     epsilon=1e-9)


## Checkpoints (delete your checkpoints if there are some unexpected errors when changing your data)

In [None]:
# making a checkpoint:
checkpoint_path = "./ckpt/"

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# lets check if we already have a checkpoint
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("Latest Checkpoint Restored...")

Latest Checkpoint Restored...


## Epochs

In [None]:
for epoch in range(EPOCHS):
    print("Start of epoch {}".format(epoch+1))
    start = time.time()

    train_loss.reset_states()
    train_accuracy.reset_states()

    # iterate on each batch:
    for (batch_index, (enc_inputs, targets)) in enumerate(dataset):
        # we take all the target minus the last word: <s> hello friend <e>. so we get rid of <s>.
        # why? because we are trying to predict the next word each time, so at the last step
        # we are predicting <e> and we are done, so we wont need it as an input for our decoder
        dec_inputs = targets[:, :-1]
        # we shift 1 to right because tokens are: <s> hello friend <e>. so we get rid of <s>.
        # when we want to do the predictions, we wont need to predict the <s>, we start with <s>
        dec_outputs_real = targets[:, 1:]

        # this will record everything that happens when we do predictions
        with tf.GradientTape() as tape:
            # the true is for training
            predictions = transformer(enc_inputs, dec_inputs, True)
            loss = loss_function(dec_outputs_real, predictions)
        
        # now we get the gradients using this method using the tape
        gradients = tape.gradient(loss, transformer.trainable_variables)
        # now we apply the gradients according to our Adam optimizer
        optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

        # now lets add our loss to the train loss object that keeps track of the loss
        train_loss(loss)
        train_accuracy(dec_outputs_real, predictions)

        # now let's print our loss and acc from time to time.....
        if batch_index % 50 == 0:
            print("Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}".format(
                epoch+1, batch_index, train_loss.result(), train_accuracy.result()))

    
    # at the end of each epoch we save a checkpoint
    ckpt_save_path = ckpt_manager.save()
    # model.save("model.h5")
    print("Saved checkpoint for epoch {}!".format(epoch+1))
        
# Epoch 3 Batch 0 Loss 2.2562 Accuracy 0.1612


Start of epoch 1
Epoch 1 Batch 0 Loss 2.2562 Accuracy 0.1612


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-41-8c2107a019cc>", line 25, in <module>
    gradients = tape.gradient(loss, transformer.trainable_variables)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/backprop.py", line 1087, in gradient
    unconnected_gradients=unconnected_gradients)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/imperative_grad.py", line 73, in imperative_grad
    compat.as_str(unconnected_gradients.value))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/backprop.py", line 156, in _gradient_function
    return grad_fn(mock_op, *out_grads)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/math_grad.py", line 1741, in _MatMulGrad
    grad_b = gen_math_ops.mat_mul(a, grad, transpose_a=True)
  File "/usr/local/li

KeyboardInterrupt: ignored

# Step 5: Evaluation

In [None]:
def evaluate(inp_sentence):
    # preprocessing
    inp_sentence = inp_sentence.lower()
    inp_sentence = re.sub(r"[^a-zA-Z]"," ", inp_sentence)
    inp_sentence = re.sub(' +', ' ', inp_sentence).strip()

    # turn the sentence to the tokenizer_encoded format [hi, bye] -> [241, 6]
    inp_sentence = [VOCAB_SIZE_EN-2] + tokenizer_en.encode(inp_sentence) + [VOCAB_SIZE_EN-1]
    # expand dim on axis=0 to simulate the batch dimmension
    enc_input = tf.expand_dims(inp_sentence, axis=0)

    # let's make the ouput which starts with <s> and add that axis=0 for batch=0
    output = tf.expand_dims([VOCAB_SIZE_FA-2], axis=0)

    # the loop to predict the next word of output each time and output += it
    for _ in range(MAX_LENGTH):
        # we put false because we are not training so no dropout
        # predictions = [btch_sz=1, seq_len(output_so_far), vocav_sz_de(the 
        # softmax values of each word, the higher the number the higher the 
        # probability for that word)]
        predictions = transformer(enc_input, output, False)
        # we want to take the last word of this prediction
        prediction = predictions[:, -1:, :]
        # we do argmax to get the index of the most probable next word
        predicted_id = tf.cast(tf.argmax(prediction, axis=-1), tf.int32)

        # we reached the end of the sentence
        if predicted_id == VOCAB_SIZE_FA-1:
            return tf.squeeze(output, axis=0)
        
        # now we know add the new prediction to the last of the output
        output = tf.concat([output, predicted_id], axis=-1)
    
    #even if we didn't reach the end of the sentence we can't continue
    return tf.squeeze(output, axis=0)
    

In [None]:
def translate(sentence):
    output = evaluate(sentence).numpy()
    # get rid of <s> and <e> if they exist
    output = [i for i in output if i < VOCAB_SIZE_FA-2]
    # decode indexes to words e.g. [241, 6] -> [hi, bye] 
    predicted_sentence = tokenizer_fa.decode(output)
    return predicted_sentence

In [None]:
input = "I am happy"
output = translate(input)
print("Input: {}".format(input))
print("Predicted translation: {}".format(output))

Input: I am happy
Predicted translation: من خوشحالم


In [None]:
input = "what a nice day"
output = translate(input)
print("Input: {}".format(input))
print("Predicted translation: {}".format(output))

Input: what a nice day
Predicted translation: چه روز خوبی است


In [None]:
input = "How is this possible"
output = translate(input)
print("Input: {}".format(input))
print("Predicted translation: {}".format(output))

Input: How is this possible
Predicted translation: چه کار می تواند چه می شود


In [None]:
input = "go on a trip"
output = translate(input)
print("Input: {}".format(input))
print("Predicted translation: {}".format(output))

Input: go on a trip
Predicted translation: به راه بیفتید


In [None]:
# !pip install tensorflowjs

In [None]:
# !tensorflowjs_converter --input_format keras '/model.h5' '/model'

## Bleu

In [None]:
import nltk

check_n_first = 2000
sum = 0
for idx in range(len(text_en_test[:check_n_first])):
    if idx % 50 == 0:
        print("done: {} / {}".format(idx, check_n_first))
    hypothesis = translate(text_en_test[idx])
    refrence = text_fa_test[idx]
    BLEUscore = nltk.translate.bleu_score.sentence_bleu(refrence, hypothesis)
    sum += BLEUscore
print(sum / check_n_first)


done: 0 / 2000


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


done: 50 / 2000
done: 100 / 2000
done: 150 / 2000
done: 200 / 2000
done: 250 / 2000
done: 300 / 2000
done: 350 / 2000
done: 400 / 2000
done: 450 / 2000
done: 500 / 2000
done: 550 / 2000
done: 600 / 2000
done: 650 / 2000
done: 700 / 2000
done: 750 / 2000
done: 800 / 2000
done: 850 / 2000
done: 900 / 2000
done: 950 / 2000
done: 1000 / 2000
done: 1050 / 2000
done: 1100 / 2000
done: 1150 / 2000
done: 1200 / 2000
done: 1250 / 2000
done: 1300 / 2000
done: 1350 / 2000
done: 1400 / 2000
done: 1450 / 2000
done: 1500 / 2000
done: 1550 / 2000
done: 1600 / 2000
done: 1650 / 2000
done: 1700 / 2000
done: 1750 / 2000
done: 1800 / 2000
done: 1850 / 2000
done: 1900 / 2000
done: 1950 / 2000
0.7017855695792813


## match casing

In [None]:

check_n_first = 100
sum = 0
cnt = 0
for idx in range(len(text_en_test[:check_n_first])):
    if idx % 50 == 0:
        print("done: {} / {}".format(idx, check_n_first))
    hypothesis = translate(text_en_test[idx])
    refrence = text_fa_test[idx]
    words = {}
    word_in_ref = {}
    for word in hypothesis:
        if word in refrence:
            word_in_ref[word] = 1
        words[word] = 1
    if len(words):
        match_case = len(word_in_ref) / len(words)
        sum += match_case
        cnt += 1
print(sum / cnt)


done: 0 / 100
done: 50 / 100
0.8109631573706647
