This notebook is to a train sequence to sequence model for punctuation prediction. This project followed the Neural Machine Translation model architecture which is available from TensorFlow.
https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt#data_cleaning_and_data_preparation \
\
***For this notebook, word tokenization is used and the input dataset to the decoder is remained as big case. ***

In [None]:
# !nvidia-smi

Sun Jan 24 05:49:52 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


**Import necessary libraries** 





In [1]:
from google.colab import files

import tensorflow as tf
import tensorflow_addons as tfa

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

from nltk.translate.bleu_score import corpus_bleu

import re
import numpy as np
import os
import io
import time

We have two corpora with the same sentences but different preprocessing, which are normalized_corpus.txt used for input to the encoder and 2008.txt used for decoder and it follows the format of our output thus called it target language. \
In order to use the data for training, the data must be preprocessed, tokenized and converted to sequences of integers using the following preprocessing methods.

**Preprocess the sentences in corpus** \
For the normalized_corpus.txt corpus, the sentences are preprocessed beforehand thus no need to be preprocessed here. \
For the 2008.txt corpus, the punctuation in each sentence is added a space before it by using regular expression for the tokenization later.
For both the corpora, we add <start> and <end> token for each sentence so that the model know when to start and stop predicting. 

**Tokenize** \
For a neural network to predict on text data, it first has to be turned into data it can understand. Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s). \
We use Keras's Tokenizer function to tokenize each unique word into a number.


**Padding** \
When batching the sequence of number together, each sequence needs to be the same length. Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length. \
Make sure all the input sequences have the same length and all the target sequences have the same length by adding padding to the end of each sequence using 'post' in Keras's pad_sequences function.

In [2]:
class NMTDataset:
    def __init__(self):
        self.inp_lang_tokenizer = None
        self.targ_lang_tokenizer = None

    ## Step 1 and Step 2 
    def preprocess_sentence(self, w, type):

        # creating a space between a word and the punctuation following it
        # eg: "he is a boy." => "he is a boy ."
        # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation

        if type == "t":
            w = re.sub(r"([\.])$", r" \1 ", w) # for . at the end
            w = re.sub("([^a-zA-Z0-9_\.-])", r" \1 ", w)  #except for - and . which is kata ganda
        w = re.sub('[\s]+', " " , w) #remove excessive spaces

        w = w.strip() #for " <end>"
        
        # adding a start and an end token to the sentence
        # so that the model know when to start and stop predicting.
        w = '<start> ' + w + ' <end>'
        return w

    def create_dataset(self, num_examples):
        # num_examples : Limit the total number of training example for faster training (set num_examples = len(lines) to use full data)
        with open('/content/drive/My Drive/files/normalized_corpus.txt' , 'r', encoding='windows-1256') as file1:
            input_lang = file1.readlines()
            input_lang = [self.preprocess_sentence(w,"i") for w in input_lang[:num_examples]]
            #input_lang = [self.preprocess_sentence(w,"i") for w in input_lang]
        with open('/content/drive/My Drive/files/2008.txt' , 'r', encoding='windows-1256') as file2:
            target_lang = file2.readlines()
            target_lang = [self.preprocess_sentence(w,"t") for w in target_lang[:num_examples]]
            #target_lang = [self.preprocess_sentence(w,"t") for w in target_lang]
        return target_lang, input_lang

    # Step 3 and Step 4
    def tokenize(self, lang, type): 
        # lang = list of sentences in a language

        if type == "t":
          lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', lower = False, oov_token='<OOV>')
        else:
          lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', lower = True, oov_token='<OOV>')
        
        #lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='', oov_token='<OOV>')
        lang_tokenizer.fit_on_texts(lang)

        ## tf.keras.preprocessing.text.Tokenizer.texts_to_sequences converts string (w1, w2, w3, ......, wn) 
        ## to a list of correspoding integer ids of words (id_w1, id_w2, id_w3, ...., id_wn)
        tensor = lang_tokenizer.texts_to_sequences(lang) 
       
        ## tf.keras.preprocessing.sequence.pad_sequences takes argument a list of integer id sequences 
        ## and pads the sequences to match the longest sequences in the given input
        tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

        # determine maximum length output sequence
        target_max_len = max(len(s) for s in tensor)
        print('Max Target Length: ', target_max_len)
        print('VOCAB Size: ', len(lang_tokenizer.word_index))
        print(lang[283])
        print(tensor[283])
        return tensor, lang_tokenizer

    def load_dataset(self, num_examples=None):
        # creating cleaned input, output pairs
        targ_lang, inp_lang = self.create_dataset(num_examples)

        input_tensor, inp_lang_tokenizer = self.tokenize(inp_lang, "i")
        target_tensor, targ_lang_tokenizer = self.tokenize(targ_lang, "t")

        return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

    def call(self, num_examples, BUFFER_SIZE, BATCH_SIZE):
        input_tensor, target_tensor, self.inp_lang_tokenizer, self.targ_lang_tokenizer = self.load_dataset(num_examples)

        input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.3)

        train_dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train))
        train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

        val_dataset = tf.data.Dataset.from_tensor_slices((input_tensor_val, target_tensor_val))
        val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)

        return train_dataset, val_dataset, self.inp_lang_tokenizer, self.targ_lang_tokenizer

**Setting some parameter**:


1.   BUFFER_SIZE: to shuffle the dataset
2.   BATCH_SIZE:  the number of samples that will be propagated through the network. It is set to 32 because 64 will cause issue of out of memory
3.   NUM_EXAMPLES: number of lines in the corpus to train the networ. Only half of the corpus is used for faster training.

Maximum sentence length, vocabulary size for both datasets and examples of sequences of integers after tokenizing and converting are printed out for debugging.


In [3]:
BUFFER_SIZE = 32000
BATCH_SIZE = 32
#MAX_VOCAB_SIZE = 20000
# Let's limit the training examples for faster training
num_examples = 250000

dataset_creator = NMTDataset()
train_dataset, val_dataset, inp_lang, targ_lang = dataset_creator.call(num_examples, BUFFER_SIZE, BATCH_SIZE)

Max Target Length:  136
VOCAB Size:  84054
<start> dalam kejadian kira-kira pukul 11.30 malam itu rakan muhammad firdhaus harun setapa 19 hanya cedera ringan dan menerima rawatan sebagai pesakit luar di hospital berdekatan <end>
[    2     9   412   318   353  5703   275     8   834   499 29337  3577
 29338   960    85  1205  2152     5   181  1114    30  2198   154     6
   788  2702     3     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     

In [4]:
example_input_batch, example_target_batch = next(iter(train_dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([32, 136]), TensorShape([32, 176]))

**Setting parameters** \
Both the embedding dimension and network size is set to 256 to reduce the trainable parameters for faster training.

In [5]:
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1
max_length_input = example_input_batch.shape[1]
max_length_output = example_target_batch.shape[1]

embedding_dim = 256
units = 256
steps_per_epoch = num_examples//BATCH_SIZE

In [6]:
print("max_length_input, max_length_output, vocab_size_input, vocab_size_output")
max_length_input, max_length_output, vocab_inp_size, vocab_tar_size

max_length_input, max_length_output, vocab_size_input, vocab_size_output


(136, 176, 84055, 106461)

**Encoder and decoder structure** \
In this part of the code, we used the same technique of Neural Machine Translation provided by TensorFlow. \
Embedding is used to capture more precise syntactic and semantic word relationships. \
LSTM layer is used in the encoder. \
For decoder, luong attention mechanism is used to selectively concentrate on a few relevant things in encoder, while ignoring others in deep neural networks

In [7]:
##### 

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    ##________ LSTM layer in Encoder ------- ##
    self.lstm_layer = tf.keras.layers.LSTM(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')



  def call(self, x, hidden):
    x = self.embedding(x)
    output, h, c = self.lstm_layer(x, initial_state = hidden)
    return output, h, c

  def initialize_hidden_state(self):
    return [tf.zeros((self.batch_sz, self.enc_units)), tf.zeros((self.batch_sz, self.enc_units))]

In [8]:
## Test Encoder Stack

encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)


# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_h, sample_c = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder h vector shape: (batch size, units) {}'.format(sample_h.shape))
print ('Encoder c vector shape: (batch size, units) {}'.format(sample_c.shape))

Encoder output shape: (batch size, sequence length, units) (32, 136, 256)
Encoder h vector shape: (batch size, units) (32, 256)
Encoder c vector shape: (batch size, units) (32, 256)


In [9]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz, attention_type='luong'):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.attention_type = attention_type

    # Embedding Layer
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    #Final Dense layer on which softmax will be applied
    self.fc = tf.keras.layers.Dense(vocab_size)

    # Define the fundamental cell for decoder recurrent structure
    self.decoder_rnn_cell = tf.keras.layers.LSTMCell(self.dec_units)



    # Sampler
    self.sampler = tfa.seq2seq.sampler.TrainingSampler()

    # Create attention mechanism with memory = None
    self.attention_mechanism = self.build_attention_mechanism(self.dec_units, 
                                                              None, self.batch_sz*[max_length_input], self.attention_type)

    # Wrap attention mechanism with the fundamental rnn cell of decoder
    self.rnn_cell = self.build_rnn_cell(batch_sz)

    # Define the decoder with respect to fundamental rnn cell
    self.decoder = tfa.seq2seq.BasicDecoder(self.rnn_cell, sampler=self.sampler, output_layer=self.fc)


  def build_rnn_cell(self, batch_sz):
    rnn_cell = tfa.seq2seq.AttentionWrapper(self.decoder_rnn_cell, 
                                  self.attention_mechanism, attention_layer_size=self.dec_units)
    return rnn_cell

  def build_attention_mechanism(self, dec_units, memory, memory_sequence_length, attention_type='luong'):
    # ------------- #
    # typ: Which sort of attention (Bahdanau, Luong)
    # dec_units: final dimension of attention outputs 
    # memory: encoder hidden states of shape (batch_size, max_length_input, enc_units)
    # memory_sequence_length: 1d array of shape (batch_size) with every element set to max_length_input (for masking purpose)

    if(attention_type=='bahdanau'):
      return tfa.seq2seq.BahdanauAttention(units=dec_units, memory=memory, memory_sequence_length=memory_sequence_length)
    else:
      return tfa.seq2seq.LuongAttention(units=dec_units, memory=memory, memory_sequence_length=memory_sequence_length)

  def build_initial_state(self, batch_sz, encoder_state, Dtype):
    decoder_initial_state = self.rnn_cell.get_initial_state(batch_size=batch_sz, dtype=Dtype)
    decoder_initial_state = decoder_initial_state.clone(cell_state=encoder_state)
    return decoder_initial_state


  def call(self, inputs, initial_state):
    x = self.embedding(inputs)
    outputs, _, _ = self.decoder(x, initial_state=initial_state, sequence_length=self.batch_sz*[max_length_output-1])
    return outputs

In [10]:
# Test decoder stack

decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE, 'luong')
sample_x = tf.random.uniform((BATCH_SIZE, max_length_output))
decoder.attention_mechanism.setup_memory(sample_output)
initial_state = decoder.build_initial_state(BATCH_SIZE, [sample_h, sample_c], tf.float32)


sample_decoder_outputs = decoder(sample_x, initial_state)

print("Decoder Outputs Shape: ", sample_decoder_outputs.rnn_output.shape)

Decoder Outputs Shape:  (32, 175, 106461)


**Optimizers and loss function** \
Adam is used as the optimizer.


In [11]:
optimizer = tf.keras.optimizers.Adam()


def loss_function(real, pred):
  # real shape = (BATCH_SIZE, max_length_output)
  # pred shape = (BATCH_SIZE, max_length_output, tar_vocab_size )
  cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
  loss = cross_entropy(y_true=real, y_pred=pred)
  mask = tf.logical_not(tf.math.equal(real,0))   #output 0 for y=0 else output 1
  mask = tf.cast(mask, dtype=loss.dtype)  
  loss = mask* loss
  loss = tf.reduce_mean(loss)
  return loss

**Define checkpoint directory**

In [12]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

**One train step operations** \
This function carries out the training of a batch of the data:


1.   Call the encoder for the batch input sequence — the output is the encoded vector
2.   Set the decoder initial states to the encoded vector
3.   Call the decoder, taking the right-shifted target sequence as the input
4.   Calculate the loss and accuracy of the batch data
5.   Update the learnable parameters of the encoder and the decoder
6.   Update the optimizer


In [13]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_h, enc_c = encoder(inp, enc_hidden)


    dec_input = targ[ : , :-1 ] # Ignore <end> token
    real = targ[ : , 1: ]         # ignore <start> token

    # Set the AttentionMechanism object with encoder_outputs
    decoder.attention_mechanism.setup_memory(enc_output)

    # Create AttentionWrapperState as initial_state for decoder
    decoder_initial_state = decoder.build_initial_state(BATCH_SIZE, [enc_h, enc_c], tf.float32)
    pred = decoder(dec_input, decoder_initial_state)
    logits = pred.rnn_output
    loss = loss_function(real, logits)

  variables = encoder.trainable_variables + decoder.trainable_variables
  gradients = tape.gradient(loss, variables)
  optimizer.apply_gradients(zip(gradients, variables))

  return loss

**Training** \
Epoch is set to 1 to avoid hitting the gpu usage limit in the Google Colab.

In [None]:
EPOCHS = 1

for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0
  # print(enc_hidden[0].shape, enc_hidden[1].shape)

  for (batch, (inp, targ)) in enumerate(train_dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss

    if batch % 100 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
  # saving (checkpoint) the model every 2 epochs
  # if (epoch + 1) % 2 == 0:
  #   checkpoint.save(file_prefix = checkpoint_prefix)
  checkpoint.save(file_prefix = checkpoint_prefix)
  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))


Epoch 1 Batch 0 Loss 1.3746
Epoch 1 Batch 100 Loss 1.1255
Epoch 1 Batch 200 Loss 0.9125
Epoch 1 Batch 300 Loss 0.9512
Epoch 1 Batch 400 Loss 0.8093
Epoch 1 Batch 500 Loss 0.8011
Epoch 1 Batch 600 Loss 0.7373
Epoch 1 Batch 700 Loss 0.7626
Epoch 1 Batch 800 Loss 0.9259
Epoch 1 Batch 900 Loss 0.7843
Epoch 1 Batch 1000 Loss 0.6650
Epoch 1 Batch 1100 Loss 0.5873
Epoch 1 Batch 1200 Loss 0.5981
Epoch 1 Batch 1300 Loss 0.5048
Epoch 1 Batch 1400 Loss 0.9733
Epoch 1 Batch 1500 Loss 0.6988
Epoch 1 Batch 1600 Loss 0.5789
Epoch 1 Batch 1700 Loss 0.4582
Epoch 1 Batch 1800 Loss 0.4924
Epoch 1 Batch 1900 Loss 0.3471
Epoch 1 Batch 2000 Loss 0.3728
Epoch 1 Batch 2100 Loss 0.3695
Epoch 1 Batch 2200 Loss 1.0068
Epoch 1 Batch 2300 Loss 0.4960
Epoch 1 Batch 2400 Loss 0.4121
Epoch 1 Batch 2500 Loss 0.4115
Epoch 1 Batch 2600 Loss 0.3099
Epoch 1 Batch 2700 Loss 0.4428
Epoch 1 Batch 2800 Loss 0.9370
Epoch 1 Batch 2900 Loss 0.4528
Epoch 1 Batch 3000 Loss 0.4821
Epoch 1 Batch 3100 Loss 0.3445
Epoch 1 Batch 3200 L

In [None]:
!zip -r './training_checkpoints.zip' './training_checkpoints'

files.download("./training_checkpoints.zip")

**Evaluate sentence operations** \
The input sentence is preprocessed, tokenized and mapped to pretrained vocabulary's word index. It then passed to the model for predicting the output. tf-addons BasicDecoder is used for decoding.

In [14]:
def evaluate_sentence(sentence):
  sentence = dataset_creator.preprocess_sentence(sentence, "i")

  #inputs = [inp_lang.word_index[i] for i in sentence.split(' ') ]
  inputs = []
  #map oov token to 1
  for i in sentence.split(' '):
    try:
      inputs.append(inp_lang.word_index[i])
    except:
      inputs.append(1)

  inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                          maxlen=max_length_input,
                                                          padding='post')
  inputs = tf.convert_to_tensor(inputs)
  inference_batch_size = inputs.shape[0]
  result = ''

  enc_start_state = [tf.zeros((inference_batch_size, units)), tf.zeros((inference_batch_size,units))]
  enc_out, enc_h, enc_c = encoder(inputs, enc_start_state)

  dec_h = enc_h
  dec_c = enc_c

  start_tokens = tf.fill([inference_batch_size], targ_lang.word_index['<start>'])
  end_token = targ_lang.word_index['<end>']

  greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler()

  # Instantiate BasicDecoder object
  decoder_instance = tfa.seq2seq.BasicDecoder(cell=decoder.rnn_cell, sampler=greedy_sampler, output_layer=decoder.fc)
  # Setup Memory in decoder stack
  decoder.attention_mechanism.setup_memory(enc_out)

  # set decoder_initial_state
  decoder_initial_state = decoder.build_initial_state(inference_batch_size, [enc_h, enc_c], tf.float32)


  ### Since the BasicDecoder wraps around Decoder's rnn cell only, you have to ensure that the inputs to BasicDecoder 
  ### decoding step is output of embedding layer. tfa.seq2seq.GreedyEmbeddingSampler() takes care of this. 
  ### You only need to get the weights of embedding layer, which can be done by decoder.embedding.variables[0] and pass this callabble to BasicDecoder's call() function

  decoder_embedding_matrix = decoder.embedding.variables[0]

  outputs, _, _ = decoder_instance(decoder_embedding_matrix, start_tokens = start_tokens, end_token= end_token, initial_state=decoder_initial_state)
  return outputs.sample_id.numpy()

def translate1(sentence,actual):
  result = evaluate_sentence(sentence)
  #print(result)
  result = targ_lang.sequences_to_texts(result)
  print('Input: %s' % (sentence))
  print('Actual: %s' % (actual))
  print('Predicted translation: {}'.format(result))

def translate2(sentence):
  result = evaluate_sentence(sentence)
  result = targ_lang.sequences_to_texts(result)
  return result

**Restoring the checkpoint** 

In [None]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))


In [24]:
 !unzip -u "/content/drive/MyDrive/training_checkpoints(capital).zip" -d "/content"    

Archive:  /content/drive/MyDrive/training_checkpoints(capital).zip
   creating: /content/training_checkpoints/
  inflating: /content/training_checkpoints/checkpoint  
  inflating: /content/training_checkpoints/ckpt-1.index  
  inflating: /content/training_checkpoints/ckpt-1.data-00000-of-00001  


In [15]:
checkpoint.restore(tf.train.latest_checkpoint("/content/training_checkpoints"))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fd12bf148d0>

**Testing the model** \
Here are examples from testing dataset that input to the model and the predictions are as below.

In [26]:
translate1("dalam situasi begini juga umat islam perlu memiliki sifat hiba sedih dan marah apabila melihat saudara seislam ditindas dan hak mereka diketepikan", "Dalam situasi begini juga, umat Islam perlu memiliki sifat hiba, sedih dan marah apabila melihat saudara seIslam ditindas dan hak mereka diketepikan.")

Input: dalam situasi begini juga umat islam perlu memiliki sifat hiba sedih dan marah apabila melihat saudara seislam ditindas dan hak mereka diketepikan
Actual: Dalam situasi begini juga, umat Islam perlu memiliki sifat hiba, sedih dan marah apabila melihat saudara seIslam ditindas dan hak mereka diketepikan.
Predicted translation: ['Dalam hal ini juga mempunyai salah kosong . <end>']


In [27]:
translate1("sebagai orang islam yang beriman elakkan perasaan rugi dan sia-sia atas pengorbanan yang dilakukan kerana sesungguhnya allah s.w.t", "\"Sebagai orang Islam yang beriman, elakkan perasaan rugi dan sia-sia atas pengorbanan yang dilakukan kerana sesungguhnya Allah s.w.t. ")

Input: sebagai orang islam yang beriman elakkan perasaan rugi dan sia-sia atas pengorbanan yang dilakukan kerana sesungguhnya allah s.w.t
Actual: "Sebagai orang Islam yang beriman, elakkan perasaan rugi dan sia-sia atas pengorbanan yang dilakukan kerana sesungguhnya Allah s.w.t. 
Predicted translation: ['Dia kena Islam yang tidak tidak tidak tidak dilakukan oleh nada yang akan diberi yang dilakukan dengan nada yang akan diberi yang dilakukan oleh nada yang akan . <end>']


In [16]:
translate1("dalam konteks ini kita harus memperbetulkan persepsi dalam masyarakat", "\"Dalam konteks ini, kita harus memperbetulkan persepsi dalam masyarakat. ")

Input: dalam konteks ini kita harus memperbetulkan persepsi dalam masyarakat
Actual: "Dalam konteks ini, kita harus memperbetulkan persepsi dalam masyarakat. 
Predicted translation: ['Dalam konteks ini saya yang boleh melatih persepsi dalam negara . <end>']


The whole dataset is passed through the model for prediction and the output are written to the prediction file.

In [None]:

with open ('/content/drive/My Drive/files/norm.txt' , 'r', encoding='windows-1256') as test_file:
  for line in test_file.readlines():
    #there are some cases where the prediction enters an infinite loop and cause out of memory
    try:
      result = translate2(line)
      result = [i.replace(" <end>" ,"") for i in result]
    except:
      result = "nil"
    
    with open ('/content/drive/My Drive/prediction.txt' , 'a', encoding='windows-1256') as test3_file:
      test3_file.writelines(result)
      test3_file.writelines('\n')

   

**Evaluating the performance** \
Bleu score is a score for comparing a candidate translation of text to one or more reference translations. It is suitable to evaluate text generated for a suite of natural language processing tasks.

In [18]:
actual, predicted = list(), list()

with open('/content/drive/My Drive/files/Capital_Word_prediction.txt' , 'r', encoding='windows-1256') as predicted_file:
  for lines in predicted_file.readlines():
    predicted.append(lines.split())
  
with open('/content/drive/My Drive/files/unnorm.txt' , 'r', encoding='windows-1256') as actual_file:
  for lines in actual_file.readlines():
    lines = dataset_creator.preprocess_sentence(lines, "t")  #preprocess it so that the punctunation will one separate token
    lines = lines.replace(' <end>','')
    lines = lines.replace('<start> ','')
    actual.append(lines.split())

# calculate BLEU score
print('BLEU: %f' % corpus_bleu(actual, predicted))

BLEU: 0.433441


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
