# Chatbot

### 목표: NMT 기법(seq2seq model, encoder-decoder model)을 활용하여 챗봇 구현을 위한 딥러닝 실시
(NMT: Neural Machine Translation)

- 텍스트 데이터의 입력 데이터 준비: tokenizing, pad_sequences
- encoder-decoder의 이해
- RNN, LSTM의 이해
- 딥러닝 구성을 위해 functional API 사용

In [1]:
import os
import yaml

dir_path = './raw_data'
files_list = os.listdir(dir_path + os.sep)

In [2]:
questions = list()
answers = list()
for filepath in files_list:
    stream = open(dir_path + os.sep + filepath, 'rb')
    docs = yaml.safe_load(stream)
    conversations = docs['conversations']
    for con in conversations:
        if len(con)>2:
            questions.append(con[0])
            replies=con[1:]
            ans = ''
            for rep in replies:
                ans += ' ' + rep
            answers.append(ans)
        elif len(con) > 1:
            questions.append(con[0])
            answers.append(con[1])

### Preparing input data for the encoder

The Encoder model will be fed input data which are preprocessed English sentences. <br>The preprocessing is done as follows:
- Tokenizing sentences
- Determining the maximun length of the sentence that's max_input_length
- Padding the tokenized_sentences to the max_input_length
- Determining the vocabulary size(num_tokens) for entire words set

<img src="https://qph.fs.quoracdn.net/main-qimg-7dab66200fb636d8eb882475e6a4fe87">

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

In [3]:
from keras import preprocessing
import numpy as np

In [4]:
## 문장 tokenizing

tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(questions)
tokenized_questions = tokenizer.texts_to_sequences(questions)

In [5]:
len(tokenized_questions)

523

In [6]:
## 최대 길이 문장의 길이 확인 -> imput 텐서 생성 maxlen 활용

length_list = list()

for token_seq in tokenized_questions:
    length_list.append(len(token_seq))

max_input_length = np.array(length_list).max()
print('questions max length is {}'.format(max_input_length))

questions max length is 22


In [7]:
## pad_sentences로 입력 텐서 생성

padded_questions = preprocessing.sequence.pad_sequences(tokenized_questions, maxlen=max_input_length,
                                                       padding='post')
encoder_input = np.array(padded_questions)
print('Encoder input data shape --> {}'.format(encoder_input.shape))
encoder_input[0]

Encoder input data shape --> (523, 22)


array([  4,   3, 109,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0])

In [8]:
## word_index로 전체 데이터의 딕셔너리 생성

questions_word_dict = tokenizer.word_index
num_questions_tokens = len(questions_word_dict) + 1
print("Number of questions tokens = {}".format(num_questions_tokens))

Number of questions tokens = 471


In [9]:
questions_word_dict

{'you': 1,
 'are': 2,
 'is': 3,
 'what': 4,
 'a': 5,
 'me': 6,
 'tell': 7,
 'do': 8,
 'the': 9,
 'joke': 10,
 'not': 11,
 'your': 12,
 'to': 13,
 'how': 14,
 'can': 15,
 'it': 16,
 'of': 17,
 'i': 18,
 'like': 19,
 'who': 20,
 'computer': 21,
 'die': 22,
 'eat': 23,
 'stock': 24,
 'get': 25,
 'about': 26,
 'market': 27,
 'hal': 28,
 'in': 29,
 'robots': 30,
 'will': 31,
 'have': 32,
 'know': 33,
 'gossip': 34,
 'gossips': 35,
 'an': 36,
 'when': 37,
 'hi': 38,
 'sense': 39,
 'robot': 40,
 'favorite': 41,
 'ever': 42,
 'up': 43,
 'much': 44,
 'immortal': 45,
 'make': 46,
 'should': 47,
 'was': 48,
 'feel': 49,
 'mad': 50,
 'going': 51,
 'bad': 52,
 'soccer': 53,
 'making': 54,
 'be': 55,
 'does': 56,
 'did': 57,
 "what's": 58,
 'money': 59,
 'guns': 60,
 'baseball': 61,
 'play': 62,
 'sapient': 63,
 'language': 64,
 'sound': 65,
 'any': 66,
 'move': 67,
 'lie': 68,
 'that': 69,
 'number': 70,
 'why': 71,
 'sad': 72,
 'nice': 73,
 'anybody': 74,
 'history': 75,
 'wavelength': 76,
 'far':

## Preparing input data for the Decoder

The Decoder model will be fed the preprocessed 'answers'. The preprocessing steps are similiar to the ones which are above. This one step is carried out before the others step.
- Append \<START\> tag at the first position in each answer sentence.
- Append \<END\> tag at the last position in each answer sentence.

In [10]:
ans = list()
for i in range(len(answers)):
    ans.append("<START>" + answers[i] + "<END>")

ans[0]

'<START>Artificial Intelligence is the branch of engineering and science devoted to constructing machines that think.<END>'

In [11]:
tokenizer1 = preprocessing.text.Tokenizer()
tokenizer1.fit_on_texts(ans)
tokenized_ans = tokenizer1.texts_to_sequences(ans)


length_list = list()
for token_seq in tokenized_ans:
    length_list.append(len(token_seq))
max_output_length = np.array(length_list).max()
print('answers max length is {}'.format(max_output_length))


padded_ans = preprocessing.sequence.pad_sequences(tokenized_ans, maxlen=max_output_length, padding='post')
decoder_input = np.array(padded_ans)
print('Decoder input data shape --> {}'.format(decoder_input.shape))


ans_word_dict = tokenizer1.word_index
num_ans_tokens = len(ans_word_dict) + 1
print("Number of answers tokens = {}".format(num_ans_tokens))

answers max length is 74
Decoder input data shape --> (523, 74)
Number of answers tokens = 1560


## Preparing target data for the Decoder

- Take a copy of tokenized_ans nad modify it like this
    1. Remove the \<START\> tag which we appenden earlier
    2. Convert the padded_ans to one-hot vectors

In [12]:
from keras import utils

In [13]:
decoder_target = list()
for token_seq in tokenized_ans:
    decoder_target.append(token_seq[1:])
    
padded_ans1 = preprocessing.sequence.pad_sequences(decoder_target, maxlen=max_output_length, padding='post')

onehot_ans = utils.to_categorical(padded_ans1, num_ans_tokens)
decoder_target = np.array(onehot_ans)
print("Decoder target data shape --> {}".format(decoder_target.shape))

Decoder target data shape --> (523, 74, 1560)


In [14]:
onehot_ans[0]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)

## Defining Model
### Sequential Model

In [15]:
from keras.models import Sequential, Model
from keras import layers

In [16]:
# model = Sequential()
# model.add(LSTM(32, return_sequences=True, input_shape=(timesteps, data_dim)))
# model.add(LSTM(32, return_sequences=True))
# model.add(LSTM(32))
# model.add(Dense(10, activation='softmax'))

# model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
#               optimizer='rmsprop',
#               metrics=['accuracy'])

In [19]:
## Masking and Padding

encoder_inputs = layers.Input(shape=(None,))
encoder_embedding = layers.Embedding(num_questions_tokens, 256, mask_zero=True)(encoder_inputs)
encoder_outputs, state_h, state_c = layers.LSTM(128, return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]

decoder_inputs = layers.Input(shape=(None,))
decoder_embedding = layers.Embedding(num_ans_tokens, 256, mask_zero=True)(decoder_inputs)
decoder_lstm = layers.LSTM(128, return_state=True, return_sequences=True)
decoder_outputs, d_state_h, d_state_c = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = layers.Dense(num_ans_tokens, activation='softmax')
output = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], output)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 256)    120576      input_3[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 256)    399360      input_4[0][0]                    
_______________________________________________________________________________________

In [20]:
model.fit([encoder_input, decoder_input], decoder_target, batch_size=128, epochs=300)
model.save('chat_model.h5')

Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78

<br>

### Encoder와 Decoder 모델을 각각 저장

In [54]:
encoder_model = Model(encoder_inputs, encoder_states)

In [55]:
encoder_model.save('chatbot_encoder_model.h5')

In [56]:
decoder_state_input_h = layers.Input(shape=(128,))
decoder_state_input_c = layers.Input(shape=(128,))

decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding, 
                                                 initial_state=decoder_states_inputs)

decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs]+decoder_states_inputs, 
                      [decoder_outputs]+decoder_states)

In [57]:
decoder_model.save('chatbot_decoder_model.h5')

In [58]:
from keras.models import load_model

In [59]:
encoder_model = load_model("chatbot_encoder_model.h5", compile=False)
decoder_model = load_model("chatbot_decoder_model.h5", compile=False)

OSError: 

In [60]:
def str_to_tokens(sentence:str):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append(questions_word_dict[word])
    
    return preprocessing.sequence.pad_sequences([tokens_list],
                                               maxlen=max_input_length, padding='post')

<br>

### Chatbot Test

In [61]:
encoder_input.shape[0]

523

In [62]:
states_values = encoder_model.predict(str_to_tokens(input('Enter eng sentence: ')))
# states_values = enc_model.predict(encoder_input_data[epoch])
empty_target_seq = np.zeros((1,1))
empty_target_seq[0,0] = ans_word_dict['start']
stop_condition=False
decoded_translation = ''

while not stop_condition:
    dec_outputs, h, c = decoder_model.predict([empty_target_seq]+states_values)
    sampled_word_index = np.argmax(dec_outputs[0,0,:])
    sampled_word = None
    for word, index in ans_word_dict.items():
        if sampled_word_index == index:
            decoded_translation += ' {}'.format(word)
            sampled_word = word
            
    if sampled_word == 'end' or len(decoded_translation.split()) > max_output_length:
        stop_condition = True
        
    empty_target_seq = np.zeros((1,1))
    empty_target_seq[0,0] = sampled_word_index
    states_values = [h,c]
    
print(decoded_translation)

Enter eng sentence:  Joke


 you think i am the internet end


In [48]:
set(questions)

{'1 dollar',
 'A spinning disk, in which the orientation of this axis is unaffected by tilting or rotation of the mounting, is called what?',
 'ARE YOU A FOOTBALL',
 'Are you amused',
 'Are you ashamed',
 'Are you experiencing an energy shortage?',
 'Are you glad',
 'Are you intoxicated',
 'Are you jealous',
 'Are you sad',
 'Are you sapient?',
 'Are you sentient?',
 'Are you stupid',
 'Bend over',
 'Can you breathe',
 'Can you control',
 'Can you die',
 'Can you go',
 'Can you malfunction',
 'Can you mate',
 'Can you move',
 'Can you walk',
 'DO YOU KNOW BASKETBAL',
 'DO YOU PLAY BASKETBALL',
 'DO YOU PLAY SOCCER',
 'Do know any jokes',
 'Do not lie',
 'Do not worry',
 'Do you ever get angry',
 'Do you ever get bored',
 'Do you ever get lonely',
 'Do you ever get mad',
 'Do you feel emotions',
 'Do you feel pain',
 'Do you feel scared',
 'Do you get embarrassed',
 'Do you get mad',
 'Do you hate anyone',
 'Do you have any brothers',
 'Do you wish you could eat food?',
 'Does it make y