<a href="https://colab.research.google.com/github/SITHUM-GIT/Dictionary/blob/main/HS_2019_0924_English_Sinhala_Tranlator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **English to Sinhala Translation System using Transformer Neural Network**

**COSC 44323 - Introduction To Deep Learning**

**Mini Project No. 03**

**Student Number: HS/2019/0924**

**Name: P.G.S.N Ilangathilaka**


**IMPORT LIBRARIES**

In [4]:
import random
import tensorflow as tf
import string
import re
from tensorflow import keras
from tensorflow.keras import layers

**MOUNT DRIVE**

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**READ THE DATASET**

In [6]:
text_file = "/content/drive/MyDrive/Deep Learning/Dictionary/Sinhala to english Dataset 40.txt"
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
i = 0
for line in lines:
  print(line)
  i = i + 1
  if(i==20):
    break

Go.	යන්න.
Hi.	ආයුබෝවන්.
Run.	දුවන්න.
Who?	WHO?
Wow!	වාව්!
Fire!	ගිනි!
Help!	උදව්!
Jump!	පනින්න!
Jump.	පනින්න.
Stop!	නවත්වන්න!
Wait!	ඉන්න!
Go on.	යන්න.
Hello!	ආයුබෝවන්!
Hurry!	ඉක්මන් කරන්න!
I see.	මම දකියි.
I try.	මම උත්සාහ කරනවා.
I won!	මම දිනුවා!
Oh no!	අපොයි නෑ!
Relax.	සන්සුන් වන්න.
Shoot!	වෙඩි තියන්න!


In [7]:
for x in range(len(lines)-10,len(lines)):
  print(lines[x])

Instead of laying off these workers, why don't we just cut their hours?	මේ කම්කරුවන් දොට්ට දමනවා වෙනුවට අපි ඔවුන්ගේ පැය ගණන කපා නොගන්නේ මන්ද?
The thieves pulled open all the drawers of the desk in search of money.	හොරු සල්ලි හොයන්න මේසයේ ලාච්චු සේරම ඇරියා.
Father kept in touch with us by mail and telephone while he was overseas.	තාත්තා විදේශගතව සිටියදී තැපෑලෙන් සහ දුරකථනයෙන් අපිව සම්බන්ධ කරගත්තා.
George Washington was the first president of the Unites States of America.	ජෝර්ජ් වොෂින්ටන් ඇමරිකා එක්සත් ජනපදයේ පළමු ජනාධිපති විය.
Mother Teresa used the prize money for her work in India and around the world.	තෙරේසා මවුතුමිය එම ත්‍යාග මුදල ඉන්දියාවේ සහ ලොව පුරා සිය කටයුතු සඳහා යෙදවූවාය.
If you go to that supermarket, you can buy most things you use in your daily life.	ඔබ එම සුපිරි වෙළඳසැලට ගියහොත්, ඔබ එදිනෙදා ජීවිතයේ භාවිතා කරන බොහෝ දේ ඔබට මිලදී ගත හැකිය.
The passengers who were injured in the accident were taken to the nearest hospital.	අනතුරින් තුවාල ලැබූ මගීන් ළඟම ඇති රෝහල වෙත රැගෙන ගොස් 

**SPLIT BOTH TRANSLATION PAIRS OF ENGLISH AND SINHALA**

In [8]:
text_pairs = []
for line in lines:

    if line.count("\t") == 1:
        english, sinhala = line.split("\t", 1)
        sinhala = "[start] " + sinhala.strip() + " [end]"
        text_pairs.append((english.strip(), sinhala))
    else:
        print("Skipping line with incorrect format:", line)

for i in range(3):
    print(random.choice(text_pairs))

('It took all night to climb Mt Fuji.', '[start] ෆුජි කන්ද තරණය කිරීමට මුළු රාත්\u200dරිය ගත විය. [end]')
('He decided to go abroad.', '[start] ඔහු විදේශගත වීමට තීරණය කළේය. [end]')
('He missed the train by a minute.', '[start] ඔහුට විනාඩියකින් දුම්රිය මග හැරුණි. [end]')


**SHUFFLE THE DATA**

In [9]:
import random
random.shuffle(text_pairs)

**SPITING THE DATASET INTO TRAINING, TESTING AND VALIDATION DATA**

In [10]:
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]
print("Total sentences:",len(text_pairs))
print("Training set size:",len(train_pairs))
print("Validation set size:",len(val_pairs))
print("Testing set size:",len(test_pairs))
print("Total size of the dataset:",len(train_pairs)+len(val_pairs)+len(test_pairs))


Total sentences: 49202
Training set size: 34442
Validation set size: 7380
Testing set size: 7380
Total size of the dataset: 49202


**REMOVING PUNCTUATIONS**

In [11]:
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")
f"[{re.escape(strip_chars)}]"

'[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\\\\\^_`\\{\\|\\}\\~¿]'

In [12]:
f"{3+5}"

'8'

**VECTORIZATION THE ENGLISH AND SINHALA TEXT PAIRS**
---



In [13]:
def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(
        lowercase, f"[{re.escape(strip_chars)}]", "")
vocab_size = 15000
sequence_length = 20
source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)
train_english_texts = [pair[0] for pair in train_pairs]
train_sinhala_texts = [pair[1] for pair in train_pairs]


In [14]:
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_sinhala_texts)

**PREPARE THE DATASET FOR THE TRANSLATION**

In [15]:
batch_size = 64
def format_dataset(eng, sin):
    eng = source_vectorization(eng)
    sin = target_vectorization(sin)
    return ({
        "english": eng,
        "sinhala": sin[:, :-1],
    }, sin[:, 1:])
def make_dataset(pairs):
    eng_texts, sin_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    sin_texts = list(sin_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, sin_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=4)
    return dataset.shuffle(2048).prefetch(16).cache()
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)
for inputs, targets in train_ds.take(1):
    print(f"inputs['english'].shape: {inputs['english'].shape}")
    print(f"inputs['sinhala'].shape: {inputs['sinhala'].shape}")
    print(f"targets.shape: {targets.shape}")
inputs['english'].shape: (64, 20)
inputs['sinhala'].shape: (64, 20)
targets.shape: (64, 20)
print(list(train_ds.as_numpy_iterator())[50])

inputs['english'].shape: (64, 20)
inputs['sinhala'].shape: (64, 20)
targets.shape: (64, 20)
({'english': array([[  17,    7,   28, ...,    0,    0,    0],
       [   6, 1609,  363, ...,    0,    0,    0],
       [1076,  900,  504, ...,    0,    0,    0],
       ...,
       [  16,    8,    2, ...,    0,    0,    0],
       [  15,  470,   37, ...,    0,    0,    0],
       [   7,   28,   40, ...,    0,    0,    0]]), 'sinhala': array([[   2,    5,   37, ...,    0,    0,    0],
       [   2, 2282, 2629, ...,    0,    0,    0],
       [   2, 5719, 2242, ...,    0,    0,    0],
       ...,
       [   2,  191,    6, ...,    0,    0,    0],
       [   2,    9, 1232, ...,    0,    0,    0],
       [   2,    5,   37, ...,    0,    0,    0]])}, array([[   5,   37,   41, ...,    0,    0,    0],
       [2282, 2629,  188, ...,    0,    0,    0],
       [5719, 2242,  273, ...,    0,    0,    0],
       ...,
       [ 191,    6,  111, ...,    0,    0,    0],
       [   9, 1232,   53, ...,    0,    0, 

**TRANSFORMER ENCODER IMPLEMENTED AS A SUBCLASSED LAYER**

In [16]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)
    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

**THE TRANSFORMER DECODER**

In [17]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True
    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config
    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)
    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        else:
            padding_mask = mask
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)


**POSITION ENCODEING**

In [18]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim
    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions
    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)
    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

**END-TO-END TRANSFORMER**

In [19]:
embed_dim = 256
dense_dim = 2048
num_heads = 8
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="sinhala")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [20]:
transformer.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 english (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 sinhala (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 positional_embedding (Posi  (None, None, 256)            3845120   ['english[0][0]']             
 tionalEmbedding)                                                                                 
                                                                                                  
 positional_embedding_1 (Po  (None, None, 256)            3845120   ['sinhala[0][0]']         

**TRAINING THE TRANSFORMER NEURAL NETWORK**

In [21]:
transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
transformer.fit(train_ds, epochs=40, validation_data=val_ds)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.src.callbacks.History at 0x78a7b07fbcd0>

**MANUAL TESTING OF THE TRANSLATION MODEL WITH 10 NEW SENTENCES**

In [23]:
import numpy as np
sin_vocab = target_vectorization.get_vocabulary()
sin_index_lookup = dict(zip(range(len(sin_vocab)), sin_vocab))
max_decoded_sentence_length = 10

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization(
            [decoded_sentence])[:, :-1]
        predictions = transformer(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = sin_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(10):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))

-
Trains are running on schedule.
[start] දුම්රිය නියමිත වේලාවට පැමිණේ [end]
-
My sister became a college student.
[start] මගේ සහෝදරිය හොඳ ශිෂ්‍යයෙකි [end]
-
Where did you learn that?
[start] ඔබ එය මිලදී ගත්තේ කොහෙන්ද [end]
-
An accident just happened.
[start] අනතුරක් වූ අතර පමණි [end]
-
I would like to be alone.
[start] මම තනියම ඉන්න කැමතියි [end]
-
I told Tom that I would do my best.
[start] මම ටොම්ට කිව්වා මම කලින් ඒක කරන්න යනවා කියලා [end]
-
She didn't know what to say to him.
[start] ඔහුව දැන සිටියේ කුමක්දැයි ඇයට කිසිවක් කීවේ නැත [end]
-
Do you think you can live on a dollar a day in America?
[start] ඔබට අවම වශයෙන් ලස්සන ජීවිතයක් ගත හැකිද [end]
-
If only I had known the answer yesterday!
[start] එය ප්‍රංශ භාෂාව පිළිබඳ එකම භාෂාවෙන් පිළිතුරු පමණි [end]
-
It's pouring.
[start] එය අවුල් සහගත ය [end]
