As the written Japanese (Kanji) is different from written Englsih, I have used Janome library to tokenize the Japanese sentences in the given dataset.

In [None]:
!pip install Janome

Collecting Janome
[?25l  Downloading https://files.pythonhosted.org/packages/a8/63/98858cbead27df7536c7e300c169da0999e9704d02220dc6700b804eeff0/Janome-0.4.1-py2.py3-none-any.whl (19.7MB)
[K     |████████████████████████████████| 19.7MB 1.3MB/s 
[?25hInstalling collected packages: Janome
Successfully installed Janome-0.4.1


In [None]:
#importing required libraries
import numpy as np
import pandas as pd
import os
import string
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from string import digits
from sklearn.utils import shuffle
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.callbacks import ModelCheckpoint
from keras.utils import to_categorical
from tqdm import tqdm

#tokenizer for japanese sequences
from janome.tokenizer import Tokenizer as janome_tokenizer

In [None]:
#loading the dataset into the notebook
from google.colab import files
data = files.upload()

Saving trainset.csv to trainset.csv


#Loading the dataset as dataframe

In [None]:
#loading the dataset as df
df = pd.read_csv('trainset.csv')

In [None]:
df

Unnamed: 0.1,Unnamed: 0,eng,jp
0,0,my opponent is shark.,俺の相手は シャークだ。
1,1,this is one thing in exchange for another.,引き換えだ ある事とある物の
2,2,"yeah, i'm fine.",もういいよ ごちそうさま ううん
3,3,don't come to the office anymore. don't call m...,もう会社には来ないでくれ 電話もするな
4,4,looks beautiful.,きれいだ。
...,...,...,...
33819,33819,where are you?,どこに居る?
33820,33820,"i'm assuming you have a little more time, you ...",まだ時間があると思ってるんだ ちょっと黙ってろ
33821,33821,nickleby?,害虫退治です アリを焼いております
33822,33822,look at me. you don't look right to me.,私を見ろ - 俺を見るな


#EDA and Data Pre-processing

In [None]:
#getting the shape of data
df.shape

(33824, 3)

In [None]:
#droping the index column from the train set
df = df.drop(['Unnamed: 0'], axis=1)

In [None]:
#checking for null values in the corpus
df.isnull().sum()

eng    0
jp     0
dtype: int64

In [None]:
#turning all words to lower-case in training dataset
#for english sequences
df['eng']=df['eng'].apply(lambda x: x.lower())
#for japanese sequences
df['jp']=df['jp'].apply(lambda x: x.lower())

In [None]:
#removing all punctuations and special characters from the datasets
remove_punc = set(string.punctuation)
#since the japanese punctuation is different, making a separate list
rem_jp_punc = set('、。【】「」『』…・〽（）〜？！｡：､；･')

#removing all the punctuations and special characters in training dataset
df['eng']=df['eng'].apply(lambda x: ''.join(ch for ch in x if ch not in remove_punc))
df['jp']=df['jp'].apply(lambda x: ''.join(ch for ch in x if ch not in remove_punc))

In [None]:
df

Unnamed: 0,eng,jp
0,my opponent is shark,俺の相手は シャークだ。
1,this is one thing in exchange for another,引き換えだ ある事とある物の
2,yeah im fine,もういいよ ごちそうさま ううん
3,dont come to the office anymore dont call me e...,もう会社には来ないでくれ 電話もするな
4,looks beautiful,きれいだ。
...,...,...
33819,where are you,どこに居る
33820,im assuming you have a little more time you in...,まだ時間があると思ってるんだ ちょっと黙ってろ
33821,nickleby,害虫退治です アリを焼いております
33822,look at me you dont look right to me,私を見ろ 俺を見るな


As we can see from the output above, all punctuations and special characters have been removed from both the datasets.

In [None]:
#removing numbers, if present, in the datasets
remove_dig = str.maketrans('', '', digits)
df['eng']=df['eng'].apply(lambda x: x.translate(remove_dig))
df['jp']=df['jp'].apply(lambda x: x.translate(remove_dig))

In [None]:
#adding the length of each sequence in the training dataset
df['len_eng_seq']=df['eng'].apply(lambda x:len(x.split(" ")))
df['len_jp_seq']=df['jp'].apply(lambda x:len(x.split(" ")))

In [None]:
df

Unnamed: 0,eng,jp,len_eng_seq,len_jp_seq
0,my opponent is shark,俺の相手は シャークだ。,4,2
1,this is one thing in exchange for another,引き換えだ ある事とある物の,8,2
2,yeah im fine,もういいよ ごちそうさま ううん,3,3
3,dont come to the office anymore dont call me e...,もう会社には来ないでくれ 電話もするな,10,2
4,looks beautiful,きれいだ。,2,1
...,...,...,...,...
33819,where are you,どこに居る,3,1
33820,im assuming you have a little more time you in...,まだ時間があると思ってるんだ ちょっと黙ってろ,11,2
33821,nickleby,害虫退治です アリを焼いております,1,2
33822,look at me you dont look right to me,私を見ろ 俺を見るな,9,3


In [None]:
#initializing the tokenizer
token_jp = janome_tokenizer()

In [None]:
#applying to japanese sentences in the dataset
df['jp'] = [' '.join([word for word in token_jp.tokenize(x, wakati=True) \
                      if word != ' ']) for x in tqdm(df['jp'])]

100%|██████████| 33824/33824 [01:00<00:00, 558.11it/s]


In [None]:
#splitting english sentences into words
df['eng'] =df['eng'].apply(lambda row: row.split())

In [None]:
#splitting japanese sentences into words
df['jp']=df['jp'].apply(lambda row: row.split())

In [None]:
df

Unnamed: 0,eng,jp,len_eng_seq,len_jp_seq
0,"[my, opponent, is, shark]","[俺, の, 相手, は, シャーク, だ, 。]",4,2
1,"[this, is, one, thing, in, exchange, for, anot...","[引き換え, だ, ある, 事, と, ある, 物, の]",8,2
2,"[yeah, im, fine]","[もう, いい, よ, ごちそうさま, ううん]",3,3
3,"[dont, come, to, the, office, anymore, dont, c...","[もう, 会社, に, は, 来, ない, で, くれ, 電話, も, する, な]",10,2
4,"[looks, beautiful]","[きれい, だ, 。]",2,1
...,...,...,...,...
33819,"[where, are, you]","[どこ, に, 居る]",3,1
33820,"[im, assuming, you, have, a, little, more, tim...","[まだ, 時間, が, ある, と, 思っ, てる, ん, だ, ちょっと, 黙っ, てろ]",11,2
33821,[nickleby],"[害虫, 退治, です, アリ, を, 焼い, て, おり, ます]",1,2
33822,"[look, at, me, you, dont, look, right, to, me]","[私, を, 見ろ, 俺, を, 見る, な]",9,3


The above output shows how each word has been separated.

In [None]:
#removing all rows where the english sentence exceeds 6 words
df=df[df['len_eng_seq']<=6]

In [None]:
#removing all rows where the japanese sentence exceeds 6 words
df=df[df['len_jp_seq']<=6]

In [None]:
df.shape

(18066, 4)

I have preformed the step above becasue without this filter, the dataset was too heavy for my computer to handle. I have only kept the sentences that are not longer than 6 words for both English and Japanese.

In [None]:
#saving the sentences into X and y
X = df['jp'].values
y = df['eng'].values

In [None]:
#getting the vocabulary for english and japanese sequences by tokenizing using Keras Tokenizer
#tokenizing eng sentences
eng_tokenizer = Tokenizer()
eng_tokenizer.fit_on_texts(y)

#tokenizing japanese sentences
jp_tokenizer = Tokenizer()
jp_tokenizer.fit_on_texts(X)

In [None]:
#getting vocab size for english
eng_vocab_size = len(eng_tokenizer.word_index) + 1 

#getting vocab size for japanese
jp_vocab_size = len(jp_tokenizer.word_index) + 1

In [None]:
print(f'English vocab size:', eng_vocab_size)
print(f'Japanese vocab size:', jp_vocab_size)

English vocab size: 9621
Japanese vocab size: 12739


In [None]:
#getting max length for the longest japanese sentence
jp_len = max(df['len_jp_seq'])
#getting max length for the longest english sentence
eng_len = max(df['len_eng_seq'])

In [None]:
#splitting the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

In [None]:
#printing the shapes of test and train data
print('Size of X_train', X_train.shape)
print('Size of y_train', y_train.shape)
print('Size of X_test', X_test.shape)
print('Size of y_test', y_test.shape)

Size of X_train (16259,)
Size of y_train (16259,)
Size of X_test (1807,)
Size of y_test (1807,)


Converting English and Japanese sentences into sequences

In [None]:
#japanese sentences to sequences
X_train = jp_tokenizer.texts_to_sequences(X_train)
X_test = jp_tokenizer.texts_to_sequences(X_test)
#englsih sentences to sequences
y_train = eng_tokenizer.texts_to_sequences(y_train)
y_test = eng_tokenizer.texts_to_sequences(y_test)

In [None]:
#padding the sequences
X_train = pad_sequences(X_train, padding='post', maxlen = jp_len)
X_test = pad_sequences(X_test, padding='post', maxlen = jp_len)
y_train = pad_sequences(y_train, padding='post', maxlen = eng_len)
y_test = pad_sequences(y_test, padding='post', maxlen = eng_len)

In [None]:
#function to one-hot encode y_train and y_test
def encode_output(sequences, vocab_size):
    ylist = list()
    for seq in sequences:
        encoded = to_categorical(seq, num_classes=vocab_size)
        ylist.append(encoded)
    y = np.array(ylist)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

In [None]:
#passing y_train and y_test to encode_output to be one-hot encoded
y_train = encode_output(y_train, eng_vocab_size)
y_test = encode_output(y_test, eng_vocab_size)

#Building the Model

Here, I have built a simple seq2seq model with LSTMs. The model can be upgraded to a better one that has a better learning rate.

In [None]:
#defining a simple seq2seq model
model = Sequential()
model.add(Embedding(jp_vocab_size, 256, input_length=jp_len, mask_zero=True))
model.add(LSTM(256))
model.add(RepeatVector(eng_len))
model.add(LSTM(256, return_sequences=True))
model.add(TimeDistributed(Dense(eng_vocab_size, activation='softmax')))

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
#printing model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 6, 256)            3261184   
_________________________________________________________________
lstm (LSTM)                  (None, 256)               525312    
_________________________________________________________________
repeat_vector (RepeatVector) (None, 6, 256)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 6, 256)            525312    
_________________________________________________________________
time_distributed (TimeDistri (None, 6, 9621)           2472597   
Total params: 6,784,405
Trainable params: 6,784,405
Non-trainable params: 0
_________________________________________________________________


Saving the model

In [None]:
#saving the model
model.save("model.bin")



INFO:tensorflow:Assets written to: model.bin/assets


INFO:tensorflow:Assets written to: model.bin/assets


#Training the Model

In [None]:
#training the model
model.fit(X_train, y_train, epochs=50, batch_size=64, validation_data=(X_test, y_test), verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fd8b3ed8dd0>

While training each epoch took around 1 minute and 15 seconds, which is why I have kept the epochs to 50. Increasing the number of epochs will increase the accuracy, which is now at 79%.

#Model Prediction

In [None]:
#displaying the predictions
model.predict(X_test[0].reshape((1, X_train[0].shape[0])))[0]

array([[2.5143189e-07, 1.7986888e-02, 1.3361334e-02, ..., 8.2269719e-11,
        3.9959732e-13, 3.3969144e-08],
       [8.5694711e-07, 6.8564042e-03, 7.1941578e-04, ..., 1.3771585e-08,
        3.6784738e-11, 3.2449901e-08],
       [1.0525955e-06, 2.8157169e-03, 1.3400968e-02, ..., 7.6063640e-13,
        4.3327755e-10, 2.2884659e-10],
       [1.4311839e-06, 1.9597939e-04, 9.7735031e-03, ..., 4.1919295e-14,
        2.9205838e-10, 2.1282805e-10],
       [9.1502030e-04, 3.6846433e-02, 5.3506184e-05, ..., 2.2830158e-16,
        2.4489810e-13, 4.7529145e-09],
       [5.9523141e-01, 3.1472158e-02, 4.3592536e-08, ..., 1.2722652e-18,
        1.7205449e-17, 2.3052278e-08]], dtype=float32)

These predictions are tokenized. They need to be mapped to sentences. The following code attempts to achieve that.

In [None]:
#mapping the sequence to sentence
def word_to_id(integer, tokenizer):
	for word, index in tokenizer.word_index.items():
		if index == integer:
			return word
	return None

In [None]:
#displaying a source sentece that has been mapped from sequences to sentence
sentence = [word_to_id(x, jp_tokenizer) for x in X_train[0]]
sentence

['という', 'こと', 'は', '子供', 'が', 'いる']

In [None]:
#displaying the target sentence
tar = [np.argmax(vector) for vector in y_train[0]]
tar

[11, 1094, 440, 0, 0, 0]

In [None]:
#mapping the target language sequence to sentence
translation = []
for i in tar:
    word = word_to_id(i, eng_tokenizer)
    if word is None:
        break
    translation.append(word)

In [None]:
#displaying the sentence
translation

['and', 'therefore', 'kids']

In [None]:
#predicting the target(english) sequence when fed a source(japanese) sequence
def predict_sequence(model, tokenizer, source):
    source = source.reshape((1, source.shape[0]))
    prediction = model.predict(source, verbose=0)[0]
    integers = [np.argmax(vector) for vector in prediction]
    target = []
    for i in integers:
        word = word_to_id(i, tokenizer)
        if word is None:
            break
        target.append(word)
    return ' '.join(target)

In [None]:
#function to map the japanese sequences to sentences
def get_japanese(row):
    words = [word_to_id(x, jp_tokenizer) for x in row]
    words = [word for word in words if word != None]
    return ' '.join(words)

In [None]:
#function to map english sequences to sentences
def get_english(row):
    ints = [np.argmax(vector) for vector in row]
    target = []
    for i in ints:
        word = word_to_id(i, eng_tokenizer)
        if word is None:
            break
        target.append(word)
    return ' '.join(target)

In [None]:
#displaying mapped sentence for japanese
get_japanese(X_train[1])

'違う あれ は 。'

In [None]:
#displaying the same sentence for english
get_english(y_train[1])

'that is'

In [None]:
#displaying the prediction made by the model for the same source sentence
predict_sequence(model, eng_tokenizer, X_train[1])

'that is'

From the code above, we can see that the model has made a very accurate prediction. The code below loops through the dataset and gets the model predictions and then outputs the source and target given in the dataset and then what that model predicted for that source sentece.

In [None]:
#looping through the dataset and getting the predictions made by the model
for i in range(40):
    print("The source sentence: ", get_japanese(X_train[i]))
    print("The translation in target langauge: ", get_english(y_train[i]))
    print("Prediction made by the model: ", predict_sequence(model, eng_tokenizer, X_train[i]))
    print('..........\nNext Prediction\n')

The source sentence:  という こと は 子供 が いる
The translation in target langauge:  and therefore kids
Prediction made by the model:  and therefore kids
..........
Next Prediction

The source sentence:  違う あれ は 。
The translation in target langauge:  that is
Prediction made by the model:  that is
..........
Next Prediction

The source sentence:  いる ん でしょ
The translation in target langauge:  i do
Prediction made by the model:  i
..........
Next Prediction

The source sentence:  ありがとう ござい まし た 。
The translation in target langauge:  okay thank you
Prediction made by the model:  thank you you much
..........
Next Prediction

The source sentence:  よ じゃあ お前 ここ 来い 。
The translation in target langauge:  you come here come here
Prediction made by the model:  come you come here
..........
Next Prediction

The source sentence:  说起来 你们到 陆地上 感 觉如何
The translation in target langauge:  come on
Prediction made by the model:  come on
..........
Next Prediction

The source sentence:  いい いい わ よ
The translation in

**Conclusion:** From these predictions we can dedue that the model did ok over the test dataset. I seems to give best predictions for smaller sentences as compared to large sentences, where it tends to make mistakes.

All in all, it can be said that if the model is trained over a longer period of time, it will produce more accurate predictions