# English to Katakana seq2seq model (credit to wanasit)

The data is a .csv file, containing english words on the first column and their translation in katakana in the second column (built by wanasit, see https://wanasit.github.io/english-to-katakana-using-sequence-to-sequence-in-keras.html for more details). We're going to build a model that automatically does the conversion from english to katakana.

In [1]:
import os
import pandas as pd
import numpy as np

from keras.layers import Input, Embedding, LSTM, TimeDistributed, Dense
from keras.models import Model, load_model

Using TensorFlow backend.


In [5]:
data = pd.read_csv('./trainingdata/joined_titles.csv', header=None)
print(data.head)

<bound method NDFrame.head of                           0                1
0               Unschooling         アンスクーリング
1                  Lovosice           ロヴォシツェ
2                     Milch              ミルヒ
3                      Juva              ユヴァ
4                 Brembilla           ブレンビッラ
5                     Sa Pa               サパ
6                   Brumano            ブルマーノ
7                Brusaporto           ブルザポルト
8                  Deventer          デーフェンテル
9                  Enschede           エンスヘーデ
10                   Tandil            タンディル
11               Buckypaper         バッキーペーパー
12                Bastiglia          バスティーリア
13          Personalization      パーソナライゼーション
14               Mandalgovi           マンダルゴビ
15                 Bomporto            ボンポルト
16            Campogalliano        カンポガッリアーノ
17              Haaksbergen         ハークスベルヘン
18               Camposanto           カンポサント
19      Castelfranco Emilia   カステルフランコ・エミーリア
20                   Fere

Let's turn this data into a X and Y vectors for training first.

In [9]:
X = [word.lower() for word in data[0]]
Y = [word for word in data[1]]

We're not done yet. Our model only takes numerical data and we cannot input strings directly. We have to build a sort of encoding for each character in english and in katakana, but also store the way to decode the characters so we can read the output of the model at the end. Also remember that a model only takes input with same size, so we have to use padding. Let's use 0 for padding, 1 for start of sequence, and just code all the characters as int based on the order they appeared.

In [11]:
START_CHAR_CODE = 1

def encode_characters(words):
    count = 2
    encoding = {}
    decoding = {1: 'START'}
    for c in set([char for word in words for char in word]): #gets all the chars of the data set
        encoding[c] = count
        decoding[count] = c
        count += 1
    return encoding, decoding, count