<a href="https://colab.research.google.com/github/Torey-Clark/japn-eng-translator/blob/local/Jpn_Eng_Translator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Japanese to English Translations</h1>
<h4>External Sources</h4>

*   <a href="https://pypi.org/project/fugashi/">Fugashi - Japanese Tokenizer</a>
*   <a href="https://pypi.org/project/mecab-python3/">MeCab - Analysis Engine</a>
*   <a href="https://pypi.org/project/unidic/">UniDic - Japanese Dictionary</a>
*   <a href="http://www.manythings.org/bilingual/jpn/">ManyThings - English-Japanese Sentence Pairs</a>

<h4>Difficulties</h4>
<p>For translating between English and Japanese, we need to identify the different words in Japanese sentences. Since Japanese does not use punctuation or spaces to signify divisions of words, we are using the tokenizer Fugashi, which is a reskin of MeCab, to identify different words and conjugations using the dictionary UniDic.
</p>

<p>Install the libraries that are not native for Tensor Flow.

Import each library we will be using throughout the system.

Declare our tokenizing object from fugashi. For performance, it is far better to have a single instance of the tokenizer than to declare one for each sentence we want to tokenize.

Set the path to the English-Japanese sentence pairs we have copied in GitHub.</p>

In [27]:
!pip install fugashi
!pip install mecab-python3
!pip install fugashi[unidic]
#!python -m unidic download #Only need to download the dictionary once. Can be commented out afterwards.
import tensorflow as tf

import os
import fugashi
import MeCab
import time
import sys
import numpy as np
import unicodedata
import re
import io

from tensorflow import keras
from tensorflow.keras import layers

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split

# Initialization
jpn_tokenizer = fugashi.Tagger("-Owakati")

# Download File
path_to_file = tf.keras.utils.get_file("jpn.txt", 
                                       origin="https://raw.githubusercontent.com/Torey-Clark/japn-eng-translator/main/jpn.txt", 
                                       extract=False)



<p>
Declare our collection of untility functions. Most of these function come directly from the tensorflow tutorial for translating between Spanish and English. Changes have been made were appropriate for our chosen languages.
</p>

In [28]:
# Utility Functions - BEGIN
debug = False
# Convert unicode file to ascii
# Not used since Japanese characters are only properly displayed using Unicode. We will print each sentence in Unicode for human readability.
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

# Convert all sentences into our standard format.
def preprocess_sentence_eng(sentence):
    #sentence = unicode_to_ascii(sentence.lower().strip())

    sentence = re.sub(r"([?.!,¿])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)

    sentence.rstrip().strip()

    sentence = '<start> ' + sentence + ' <end>'
    if debug:
        print(len(sentence))
        print(sentence)

    return sentence

# Use Fugashi to tokenize Japanese sentences and place into our standard format.
def preprocess_sentence_jpn(sentence):
    # For the sake of human readability, we do not convert Japanese into ASCII.
    #sentence = unicode_to_ascii(sentence.lower().strip())

    words = [word.surface for word in jpn_tokenizer(sentence)]

    sentence = '<start> '
    for word in words:
        sentence = sentence + word + ' '

    #sentence = sentence[:-1] # Remove the trailing space.
    sentence = sentence + '<end>'
    if debug:
        print("First word: " + words[1])
        print("Length of sentence: " + len(sentence))
        print("Sentence: " + sentence)

    return sentence

# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, JAPANESE]
def create_dataset(path, num_examples):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = []
    # For each line in the file, split it between English and Japanese
    for l in lines:
        temp_word_pairs = []
        words = l.split('\t')
        temp_word_pairs.append(preprocess_sentence_eng(words[0])) # Pre-Process English sentences
        temp_word_pairs.append(preprocess_sentence_jpn(words[1])) # Pre-Process Japanese sentences
        word_pairs.append(temp_word_pairs)

    print("Dataset Created")
    return zip(*word_pairs)

def tokenize(lang): 
    print("Tokenizing")
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    lang_tokenizer.fit_on_texts(lang)

    tensor = lang_tokenizer.texts_to_sequences(lang)

    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')

    print("Tokenized")
    return tensor, lang_tokenizer

def load_dataset(path, num_examples=None):
    targ_lang, inp_lang = create_dataset(path, num_examples)

    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

def convert(lang, tensor):
    for t in tensor:
        if t != 0:
            print("%d ----> %s" % (t, lang.index_word[t]))
# Utility Function - END

<p>We can see that fugashi is properly tokenizing the Japanese sentence so we can use the pairs for training the model and translating future sentences.</p>

In [29]:
# Example sentences
eng_sentence_example = u"If someone who doesn't know your background says that you sound like a native speaker, it means they probably noticed something about your speaking that made them realize you weren't a native speaker. In other words, you don't really sound like a native speaker."
jpn_sentence_example = u"生い立ちを知らない人にネイティブみたいに聞こえるよって言われたら、それはおそらく、あなたの喋り方のどこかが、ネイティブじゃないと感じさせたってことだよ。つまりね、ネイティブのようには聞こえないということなんだよ。"

print(preprocess_sentence_eng(eng_sentence_example))
print(preprocess_sentence_jpn(jpn_sentence_example))

<start> If someone who doesn't know your background says that you sound like a native speaker , it means they probably noticed something about your speaking that made them realize you weren't a native speaker . In other words , you don't really sound like a native speaker .  <end>
<start> 生い立ち を 知ら ない 人 に ネイティブ みたい に 聞こえる よ って 言わ れ たら 、 それ は おそらく 、 あなた の 喋り 方 の どこ か が 、 ネイティブ じゃ ない と 感じ させ た って こと だ よ 。 つまり ね 、 ネイティブ の よう に は 聞こえ ない と いう こと な ん だ よ 。 <end>


In [30]:
maximum_dataset_size = 250
eng, jpn = create_dataset(path_to_file, None)
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, maximum_dataset_size)
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

Dataset Created
Dataset Created
Tokenizing
Tokenized
Tokenizing
Tokenized


<p>We declare how large the training set will be. A larger training set will take longer but will produce more accurate results.</p>

In [31]:
num_epochs = 1
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index) + 1
vocab_tar_size = len(targ_lang.word_index) + 1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

example_input_batch, example_taget_batch = next(iter(dataset))
example_input_batch.shape, example_taget_batch.shape

(TensorShape([64, 61]), TensorShape([64, 50]))

In [32]:
# Encoder
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.enc_units, return_sequences=True,  return_state=True, recurrent_initializer='glorot_uniform')

    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
        return output, state

    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

In [33]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)

In [34]:
# Attention
class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # query hidden state shape == (batch_size, hidden size)
    # query_with_time_axis shape == (batch_size, 1, hidden size)
    # values shape == (batch_size, max_len, hidden size)
    # we are doing this to broadcast addition along the time axis to calculate the score
    query_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

In [35]:
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

In [36]:
# Decoder
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

In [37]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Decoder output shape: (batch_size, vocab size) (64, 10070)


In [38]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

# Function definitions
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    
    return tf.reduce_mean(loss_)

In [39]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer, encoder=encoder, decoder=decoder)

In [40]:
@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)

        dec_hidden = enc_hidden
        dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

        #Teacher forcing - feeding the target as the next input
        for t in range(1, targ.shape[1]):
            # passing enc_output to the decoder
            predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

            loss += loss_function(targ[:, t], predictions)

            dec_input = tf.expand_dims(targ[:, t], 1)

    batch_loss = (loss / int(targ.shape[1]))

    variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, variables)

    optimizer.apply_gradients(zip(gradients, variables))

    return batch_loss

<h3>Training</>
<p>This is where the model is trained using the dataset we have and using the size we declared earlier.</>

In [None]:
for epoch in range(num_epochs):
    start = time.time()

    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss += batch_loss

        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, batch_loss.numpy()))
            #test

            # saving (checkpoint) the model every 2 epochs
            if (epoch + 1) % 2 == 0:
                checkpoint.save(file_prefix=checkpoint_prefix)

            print('Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch))
            print('Time taken for 1 epoch {:.2f} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 1.6246
Epoch 1 Loss 0.0000
Time taken for 1 epoch 101.21 sec

Epoch 1 Batch 100 Loss 0.8395
Epoch 1 Loss 0.0022
Time taken for 1 epoch 3452.83 sec

Epoch 1 Batch 200 Loss 0.8238
Epoch 1 Loss 0.0042
Time taken for 1 epoch 6796.55 sec

Epoch 1 Batch 300 Loss 0.7528
Epoch 1 Loss 0.0059
Time taken for 1 epoch 10138.24 sec



In [None]:
def evaluate(sentence):
    attention_plot = np.zeros((max_length_targ, max_length_inp))

    # Tokenize the Japanese sentence to have spaces between words
    sentence = preprocess_sentence_jpn(sentence)
    inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]

    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding='post')
    inputs = tf.convert_to_tensor(inputs)

    result = ''

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)

        # storing the attention weights to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += targ_lang.index_word[predicted_id] + ' '

        if targ_lang.index_word[predicted_id] == '<end>':
            return result, sentence, attention_plot

        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

In [None]:
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
  fig = plt.figure(figsize=(10,10))
  ax = fig.add_subplot(1, 1, 1)
  ax.matshow(attention, cmap='viridis')

  fontdict = {'fontsize': 14}

  ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
  ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

  ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
  ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

  plt.show()

In [None]:
def translate(sentence):
    print("Input sentence : " + sentence)
    result, sentence, attention_plot = evaluate(sentence)
    print('Predicted translation: {}'.format(result))

    attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
    #plot_attention(attention_plot, sentence.split(' '), result.split(' '))




---



---



In [None]:
# Restore Checkpoint
print("Restoring latest checkpoint...")
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
print("Checkpoint restored.")

In [None]:
eval_sentence = "おはようございます"
translate(eval_sentence)
print("Actual Translation: Good morning.")