# 自然言語処理
RNN(Recurrent Neural Network)を使用して、足し算の計算を学習させる。

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Activation, RepeatVector
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import TimeDistributed
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

Using TensorFlow backend.


In [2]:
np.random.seed(0)

In [3]:
def n(digits=3):
    number = ''
    for i in range(np.random.randint(1, digits + 1)):
        number += np.random.choice(list('0123456789'))
    return int(number)


def padding(chars, maxlen):
    return chars + ' ' * (maxlen - len(chars))

In [4]:
'''
データの生成
'''
N = 20000
N_train = int(N * 0.9)
N_validation = N - N_train

digits = 3  # 最大の桁数
input_digits = digits * 2 + 1  # 例： 123+456
output_digits = digits + 1  # 500+500 = 1000 以上で４桁になる

added = set()
questions = []
answers = []

while len(questions) < N:
    a, b = n(), n()  # 適当な数を２つ生成

    pair = tuple(sorted((a, b)))
    if pair in added:
        continue

    question = '{}+{}'.format(a, b)
    question = padding(question, input_digits)
    answer = str(a + b)
    answer = padding(answer, output_digits)

    added.add(pair)
    questions.append(question)
    answers.append(answer)

chars = '0123456789+ '
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

X = np.zeros((len(questions), input_digits, len(chars)), dtype=np.integer)
Y = np.zeros((len(questions), digits + 1, len(chars)), dtype=np.integer)

for i in range(N):
    for t, char in enumerate(questions[i]):
        X[i, t, char_indices[char]] = 1
    for t, char in enumerate(answers[i]):
        Y[i, t, char_indices[char]] = 1

X_train, X_validation, Y_train, Y_validation = \
    train_test_split(X, Y, train_size=N_train)



In [5]:
'''
モデル設定
'''
n_in = len(chars)
n_hidden = 128
n_out = len(chars)

model = Sequential()

# Encoder
model.add(LSTM(n_hidden, input_shape=(input_digits, n_in)))

# Decoder
model.add(RepeatVector(output_digits))
model.add(LSTM(n_hidden, return_sequences=True))

model.add(TimeDistributed(Dense(n_out)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(lr=0.001, beta_1=0.9, beta_2=0.999),
              metrics=['accuracy'])

## 学習
"a+b"という文字列を入力データ、"t"("t"は、a+bの値を文字で表現したもの)を教師データとして与えて学習させる。２０エポックごとに、学習済モデルを使用して予測された結果を表示している。足し算の計算方法に関する知識はもちろん、"a"や"b"が数字であることや、"+"や"="の記号の意味すら教えていないにも関わらず、学習が進むにつれて、入力データと教師データの組の集合から、足し算の結果を予測できていることが分かる。

In [6]:
'''
モデル学習
'''
epochs = 200
batch_size = 200
sampling = 20

for epoch in range(epochs // sampling):
    model.fit(X_train, Y_train, batch_size=batch_size, epochs=sampling,
              validation_data=(X_validation, Y_validation))

    print("\nEpoch: {0}/{1}".format((epoch+1) * sampling, epochs))
    # 検証データからランダムに問題を選んで答え合わせ
    for i in range(10):
        index = np.random.randint(0, N_validation)
        question = X_validation[np.array([index])]
        answer = Y_validation[np.array([index])]
        prediction = model.predict_classes(question, verbose=0)

        question = question.argmax(axis=-1)
        answer = answer.argmax(axis=-1)

        q = ''.join(indices_char[i] for i in question[0])
        a = ''.join(indices_char[i] for i in answer[0])
        p = ''.join(indices_char[i] for i in prediction[0])

        print('-' * 10)
        print('Q:  ', q)
        print('A:  ', p)
        print('T/F:', end=' ')
        if a == p:
            print('T')
        else:
            print('F')
    print('-' * 10 + '\n')

Train on 18000 samples, validate on 2000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

Epoch: 20/200
----------
Q:   190+0  
A:   110 
T/F: F
----------
Q:   747+45 
A:   711 
T/F: F
----------
Q:   714+889
A:   1551
T/F: F
----------
Q:   37+188 
A:   190 
T/F: F
----------
Q:   80+348 
A:   413 
T/F: F
----------
Q:   956+50 
A:   1015
T/F: F
----------
Q:   618+686
A:   1355
T/F: F
----------
Q:   379+52 
A:   411 
T/F: F
----------
Q:   379+87 
A:   455 
T/F: F
----------
Q:   459+36 
A:   583 
T/F: F
----------

Train on 18000 samples, validate on 2000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

Ep

Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

Epoch: 60/200
----------
Q:   778+58 
A:   836 
T/F: T
----------
Q:   12+319 
A:   331 
T/F: T
----------
Q:   95+124 
A:   219 
T/F: T
----------
Q:   0+404  
A:   404 
T/F: T
----------
Q:   77+67  
A:   144 
T/F: T
----------
Q:   5+182  
A:   187 
T/F: T
----------
Q:   376+47 
A:   433 
T/F: F
----------
Q:   516+48 
A:   564 
T/F: T
----------
Q:   456+86 
A:   542 
T/F: T
----------
Q:   832+33 
A:   865 
T/F: T
----------

Train on 18000 samples, validate on 2000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

Epoch: 80/200
----------
Q:   1+531  
A:   532 
T/F: T
----------
Q:   946+7  
A:   953 
T/F: T
----------
Q:   468+23 
A:   491 
T/F: T
----------
Q:   22+55  
A:   77  
T/F: T
--------

Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

Epoch: 160/200
----------
Q:   34+20  
A:   54  
T/F: T
----------
Q:   74+36  
A:   110 
T/F: T
----------
Q:   573+37 
A:   610 
T/F: T
----------
Q:   794+40 
A:   824 
T/F: F
----------
Q:   667+14 
A:   681 
T/F: T
----------
Q:   16+80  
A:   96  
T/F: T
----------
Q:   272+6  
A:   278 
T/F: T
----------
Q:   0+89   
A:   89  
T/F: T
----------
Q:   0+582  
A:   582 
T/F: T
----------
Q:   49+536 
A:   595 
T/F: F
----------

Train on 18000 samples, validate on 2000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

Epoch: 180/200
----------
Q:   263+810
A:   1073
T/F: T
----------
Q:   459+997
A:   1445
T/F: F
----------
Q:   6+976  
A:   982 
T/F: T
----------
Q:   197+2  
A:   199 
T/F: T
----------
Q:   1+

In [7]:
model.save("addproblem.h5")

In [8]:
ls

Overview.ipynb  README.md  Sample1.ipynb  Sample3.ipynb  addproblem.h5
