# Seq2Seq Chatbot with Attention

#### Members' Names: Bilal Majeed and Sandhya Prakash

#### Members' Emails: bmajeed@ryerson.ca and ksandhya@ryerson.ca

# Introduction:

#### Problem Description:

Previously, chatbots have been developed using hand-written rules, making models static. With the recent focus in the area of deep learning and natural language processing (NLP), there have been several advancements in the space. For example, Sequence to Sequence (seq2seq) models use recurrent neural networks to solve complex language problems. The general architecture consists of an encoder and a decoder, each being stacks of LSTM or GRU layers. But, Seq2Seq models lack the ability to work with larger sentences because all information from the input sentence is required to be encoded into a fixed length vector, and is sequentially passed until the output is produced. Therefore, knowledge of the input is slowly lost throughout the prediction process.

#### Context of the Problem:

Text generation is a important topic is NLP which includes different applications, e.g., machine translation, caption generation, speech to text, and chatbots. Hence, solving issues identified in any one application of text generation is beneficial as it can be applied to other applications. For businesses, any models that are to be adopted need to be effective and scalable. With rule-based models, there is a lack of effectiveness and scalability, while with vanilla seq2seq models, there is an issue with effectiveness as input and output sizes increases. The Seq2Seq works well but the aim is to strengthen the performance of this model on more complex tasks with longer input and output sequences.

#### Limitation About other Approaches:

Rule-based models: A pattern and a template are written, so that when pattern is seen in the input, the chatbot replies with one of the templates. The pattern matching is not effective and rule-based chatbots suffer when a pattern is not recognized. Also, it is time consuming and difficult to write the rules manually.

Retreival-based models: Similar to rule-based models, there is a set of responses made available to the chatbot. Based on the provided input, the chatbot selects the most appropriate response from a list. This is useful as there are no issues with grammar but this fails when unseen inputs are provided. 

Vaniall seq2seq models: A response, word by word based on the input and because of this, the responses generated can include grammatical errors. Once they are trained, they perform well in terms of handling previously unseen inputs. But, there is a bottle neck as it is difficult to encode the input sequence into a single fixed-length vector.

#### Solution:

To improve the seq2seq model and make it more robust, during training, an attention mechanism is included to gather context for each input word, allowing the decoder to have more information about each part of the input sentence. To generate the next word in the output, the context from the user input and the generated output so far is used to calculate the output with "attention". 

# Background
| Reference | Explanation |  Dataset/Input | Weakness
| --- | --- | --- | --- |
| Sutskever et al. [2] | Utilized a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. | WMT'14 English to French | Degredation in performance after input length of 35
| Abonia Sojasingarayar [3] | Trained an attention based sequence to sequence model using LSTM to predict an output sentence given a user input | Cornell Movie Subtitle Corpus | Worked well but processing power caused issues in training and hyperparamter optimization
| Luong et al. [4] | Focused on discussing a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time | WMT'14 English to French | Compared to the alignment visualizations in (Bahdanau et al., 2015), alignment patterns are not as sharp.

Instead of using the attention described in [3], the paper suggested using Luong attention as a next step, so we used the global attention mechanism described in [4].

# Methodology

This paper focuses on using a encoder-decoder model with an attention mechanism that allows the decoder to selectively look at the input sequence while decoding. This helps counter the bottleneck mentioned in the limitations of vaniall seq2seq models above. The vanilla seq2seq model for a chatbot accepts an input from a user, cleans the input, encodes the input using the trained model, and decodes hidden states to produce a predicted reply.

#### Data Prepartion

The data consists of 2 files: movie lines and groups of conversations using those lines. Each line is split by a new line character and each column is split by '+++$+++'. 
- Split movie_lines file and store is a list with all columns (available columns)
- Data size is limited to 20000 rows due to hardware constrains and is stored in dictionary of line_id, line pairs
- Coversation lines are stored in dictionary of conversation_id, list of lines pairs
- Generate **model input**  data using conversations, example:
    - Conversation: Lines 147, 148, 149
    - Input: Lines 147, 148
    - Output: Lines 148, 149
    - Add **'BOS'** and **'EOS'** tags to the output lines 
    - The inputs and outputs are cleaned to get the best possible vocabulary
    - Inputs and outputs are converted to sequences: ["Hello"] --> [149] if 'Hello' is at index 149 in the vocabulary
    - Inputs and outputs are padded to the calculated max length of inputs and outputs: [149, 0, 0, 0] if max length is 4
- Generate **model output** data:
    - Remove **'BOS'** tag, convert to sequences, and pad to the max length of outputs
    - Use to_categorical to turn to a matrix: [149, ...] --> [[0, ..., 1, 0, ..., 0], ...] (1 on the 149th element of the list)
    - The tag is removed so that the model learns that given **'BOS'** in the decoder, the output word is the next word in the sentence

#### Model Building (Encoder-Decoder)

The model used is a encoder-decoder model with LSTM layers and an attention mechanism.

![Seq2Seq](Artifacts/LSTM_encoder_decoder.png "Seq2Seq Model")
- Encoder:
    - Embedding and LSTM layers with 256 units outputting overall sequence output and encoder states (thought vector in figure above)
- Decoder:
    - Embedding and LSTM layers with 256 units outputting overall sequence output and decoder states
    - Encoder states from the encoder are used as initial states of the decoder 

#### Model Building (Attention)
![Global Attention](Artifacts/seqseq_globalattention.png "Global Attention")
- The dot product of the encoder overall sequence output and the decoder outputs with the softmax generates the attention vector (global align weights in figure above)
    - This vector helps the current decoder state keep track of the overall input
- The dot product of the attention vector (global align weights) and the encoder outputs generates the context vector (context vector in figure above)
- The context vector and the decoder output are then concatenated, and used as an input to a dense layer for the final output vector of vocabulary size

#### Inference Model and Predictions
- Requires the generation of models based on the trained models:
    - Encoder uses the encoder inputs and outputs the encoder states defined in the model building
    - Decoder uses the current states (starts with encoder states) with the current word (starts with 'BOS') and outputs the decoder states
- Predictions use the outputs from the inference encoder and decoder model:
    - Runs till loop breaks when user says 'quit'
    - Preprocess user given input and run encoder to get encoder states
    - Set current word to 'BOS' and preprocess current word 
    - Run generated prediction loop until 'EOS' is seen or max length of output is reached
        - Input current word and current states to decoder model to get decoder states
        - Use similar attention mechanism defined in model
        - Use dense layer from trained model for final prediction
        - Find predicted word in vocabular and concatenate to predicted output
        - Set current word to predicted word, and set current state to decoder states

# Implementation

#### Import Libraries

In [1]:
# preprocessing libraries
import codecs 
import re
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

# network libraries
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM, TimeDistributed
from keras.layers import dot
from keras.layers import concatenate
from keras.layers import Activation
from keras.layers import Embedding
from keras.layers import Dropout

#### Load Lines Using Data File

In [2]:
def load_file ():
    data = []
    
    with codecs.open("movie_lines.txt", "rb", encoding = "utf-8", errors = "ignore") as f:
        # split rows by new line
        rows = f.read().split("\n")
        for row in rows:
            # split columns by '+++$+++'
            data.append(row.split(" +++$+++ "))
            
    return data

data = load_file()
total_lines = len(data)
print(f"Total lines in dataset: {total_lines}")

def load_lines (data):
    sentences = {}

    for row in data[:40000]:
        # check if all columns are available
        if len(row) > 4:
            # use line number as key and sentence as value    
            sentences[int(row[0][1:])] = row[4]
            
    # sort dictionary by line number
    return dict(sorted(sentences.items()))

lines = load_lines(data)
total_lines = len(list(lines.keys()))
print(f"Total selected lines from dataset: {total_lines}")

Total lines in dataset: 304714
Total selected lines from dataset: 40000


#### Load Conversations Using Data File

In [3]:
def load_conversations ():
    conversations = {}
    conversation_number = 1
    
    with codecs.open("movie_conversations.txt", "rb", encoding="utf-8", errors="ignore") as f:
        # split rows by new line
        rows = f.read().split("\n")
        for row in rows:
            # split columns by '+++$+++', only get lines per conversation, remove "[" and "]"
            conversation_ids = row.split(" +++$+++ ")[-1][1:-1]
            line_ids = []
            # for each line in conversation
            for line_id in conversation_ids.split(","):
                # remove extra quotes
                line_id = line_id.replace("'", "").strip()
                line_ids += [line_id[1:]]
                
            # store list of lines
            conversations[conversation_number] = line_ids
            conversation_number += 1
            
    return conversations

conversation_dictionary = load_conversations()

#### Combine Lines and Conversations to Generate Input and Output

In [4]:
def generate_inputs_outputs (conversations_dictionary, lines, maxlen):
    inputs = []
    outputs = []
    
    for current_conv in conversations_dictionary.values():
        # make sure that each conversation contains atleast 2 lines
        if len(current_conv) > 2:
            current_conv = current_conv[:-1]
        # convert questions and answers to the list of tuples
        for i in range(0, len(current_conv)):
            if len(current_conv) - i > 1:
                # add to inputs and outputs if conversation is in selected lines
                # to reduce output and input length, a cutoff is provided
                try:
                    if len(lines[int(current_conv[i])].split()) <= maxlen and \
                        len(lines[int(current_conv[i + 1])].split()) <= maxlen:
                        inputs += [lines[int(current_conv[i])]]
                        outputs += [lines[int(current_conv[i + 1])]]
                except:
                    continue
    
    return inputs, outputs

inputs, outputs = generate_inputs_outputs(conversation_dictionary, lines, 20)

#### Clean Lines and Add Tags

In [5]:
def replace_text (sentence):
    sentence = sentence.lower()
    
    sentence = re.sub(r"i'm", "i am", sentence)
    sentence = re.sub(r"he's", "he is", sentence)
    sentence = re.sub(r"she's", "she is", sentence)
    sentence = re.sub(r"it's", "it is", sentence)
    sentence = re.sub(r"that's", "that is", sentence)
    sentence = re.sub(r"what's", "that is", sentence)
    sentence = re.sub(r"where's", "where is", sentence)
    sentence = re.sub(r"how's", "how is", sentence)
    sentence = re.sub(r"\'ll", " will", sentence)
    sentence = re.sub(r"\'ve", " have", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"\'d", " would", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"won't", "will not", sentence)
    sentence = re.sub(r"can't", "cannot", sentence)
    sentence = re.sub(r"c'mon", "come on", sentence)
    sentence = re.sub(r"n't", " not", sentence)
    sentence = re.sub(r"n'", "ng", sentence)
    sentence = re.sub(r"'bout", "about", sentence)
    sentence = re.sub(r"'til", "until", sentence)
    sentence = re.sub(r"  ", " ", sentence)
    sentence = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", sentence)
    
    return sentence

def preprocess_inputs_outputs (inputs, outputs):
    
    for i in range(len(inputs)):
        inputs[i] = replace_text(inputs[i])
    # add <BOS> and <EOS> tags for output to keep track of start and end
    for i in range(len(outputs)):
        outputs[i] = '<BOS> ' + replace_text(outputs[i]) + ' <EOS>'
    
    return inputs, outputs
    
inputs, outputs = preprocess_inputs_outputs(inputs, outputs)
print(f"Inputs in dataset: {len(inputs)}")
print(f"Outputs in dataset: {len(outputs)}")

Inputs in dataset: 17635
Outputs in dataset: 17635


#### Transform Sentences to Padded Sequences

In [6]:
# give symbols and numbers that are not needed to the tokenizer
filtering_pattern = '!"#$%&()*+,-./:;=?@[\]^_`{|}~\t\n\'0123456789'
tokenizer = Tokenizer(filters = filtering_pattern)
# this will generate vocab on all inputs and outputs
tokenizer.fit_on_texts(inputs + outputs)
vocab_size = len(tokenizer.word_index) + 1
print(f"Vocabulary size: {vocab_size}")

# tokenize and pad inputs
tokenized_inputs = tokenizer.texts_to_sequences(inputs)
maxlen_inputs = max([len(x) for x in tokenized_inputs])
print(f"Maximum Input Length: {maxlen_inputs}")
encoder_input_data = pad_sequences(tokenized_inputs, maxlen = maxlen_inputs, padding = 'post')

# tokenize and pad outputs
tokenized_outputs = tokenizer.texts_to_sequences(outputs)
maxlen_outputs = max([len(x) for x in tokenized_outputs])
print(f"Maximum Output Length: {maxlen_outputs}")
decoder_input_data = pad_sequences(tokenized_outputs, maxlen = maxlen_outputs, padding = 'post')

# remove '<BOS>' from every output
# this will be used as the next word to be predcited since '<BOS>' will be given
for i in range(len(tokenized_outputs)):
    tokenized_outputs[i] = tokenized_outputs[i][1:]
# pad and create a matrix based on vocabulary
padded_outputs = pad_sequences(tokenized_outputs, maxlen = maxlen_outputs, padding = 'post')
decoder_output_data = to_categorical(padded_outputs, vocab_size)

Vocabulary size: 11903
Maximum Input Length: 25
Maximum Output Length: 27


#### Seq2Seq Model with Attention for Training

In [7]:
# encoder model with LSTM layer with 256 units, returning outputs and states
enc_inputs = Input(shape=(None,))
enc_embedding = Embedding(vocab_size, 256, mask_zero = True)(enc_inputs)
enc_lstm = LSTM(256, return_sequences = True, return_state = True)
enc_outputs, enc_state_h, enc_state_c = enc_lstm(enc_embedding)
enc_states = [enc_state_h, enc_state_c]
enc_outputs = enc_outputs

# decoder model with LSTM layer with 256 units, returning outputs and states
dec_inputs = Input(shape = (None,))
dec_embedding = Embedding(vocab_size, 256, mask_zero = True)(dec_inputs)
dec_lstm = LSTM(256, return_state = True, return_sequences = True)
dec_outputs, dec_state_h, dec_state_c = dec_lstm(dec_embedding, initial_state = enc_states)

# connect decoder and encoder outputs to generate an attention vector with softmax applied
attention = dot([dec_outputs, enc_outputs], axes = [2, 2])
attention = Activation('softmax')(attention)
# connect the attention vector and encoder outputs to generate context vector
context = dot([attention, enc_outputs], axes = [2, 1])
# connect context vector and decoder outputs to use as input for dense layer
dec_attention_outputs = concatenate([context, dec_outputs])

# use decoder outputs concatenated with the context vector to output final prediction
dec_dense = TimeDistributed(Dense(vocab_size, activation = 'softmax'))
output = dec_dense(dec_attention_outputs)

# generate and compile model
model = Model([enc_inputs, dec_inputs], output)
model.compile(optimizer = "adam", loss = 'categorical_crossentropy', metrics = ["accuracy"])

print(model.summary())

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 256)    3047168     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 256)    3047168     input_2[0][0]                    
______________________________________________________________________________________________

In [8]:
history = model.fit([encoder_input_data, decoder_input_data], decoder_output_data, 
                    batch_size = 128, epochs = 30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


#### Seq2Seq Model for Inference using Trained Model

In [9]:
# encoder model using inputs and outputs defined above
enc_model = Model(enc_inputs, [enc_outputs, enc_states])

# set intputs to receive encoder model states output by enc_model
decoder_initial_state_h = Input(shape = (256, ))
decoder_initial_state_c = Input(shape = (256, ))
decoder_initial_states = [decoder_initial_state_h, decoder_initial_state_c]
# lstm layer is defined again to set up with decoder_inital_states variable because this will change 
# as predictions more forward
decoder_outputs, decoder_state_h, decoder_state_c = dec_lstm(dec_embedding , initial_state = decoder_initial_states)
decoder_states = [decoder_state_h, decoder_state_c]
# decoder model using lstm layer defined above in trained model with variabled defined
dec_model = Model([dec_inputs, decoder_initial_states], [decoder_outputs] + decoder_states)

#### Sample Chatbot using Inference Model with Attention

In [14]:
while True:
    # take user input
    userinput = input('You:')
    # if input is 'quit', return 'bye'
    if userinput == 'quit':
        print('Bot: ' + 'Bye')
        break
    
    # preprocess user input
    userinput = replace_text(userinput)
    userinput = tokenizer.texts_to_sequences([userinput])
    userinput = pad_sequences(userinput, maxlen = maxlen_inputs, padding = 'post')
    
    # return encoder outputs and states for given user input
    encoder_outputs, encoder_states = enc_model.predict(userinput)
    
    # initialize current word as '<bos>' and encoder states as initial input states
    current_word = '<bos>'
    current_input_states = encoder_states
    predicted_output = ''

    for i in range(maxlen_outputs):
        # transform current word to vocab index
        empty_target_seq = np.zeros((1, 1))
        empty_target_seq[0, 0] = tokenizer.word_index[current_word]
        
        # return decoder outputs and states for current word and current states
        decoder_outputs, decoder_state_h, decoder_state_c = dec_model.predict([empty_target_seq] + current_input_states)
        
        # derive context and predict next word using decoder output with attention
        attention_output = dot([decoder_outputs, encoder_outputs], axes=[2,2])
        attention_output = Activation('softmax')(attention_output)
        context_output = dot([attention_output, encoder_outputs], axes=[2,1])
        dec_attention = concatenate([context_output, decoder_outputs])
        final_output1 = dec_dense(dec_attention)

        # find the predicted word 
        for word, idx in tokenizer.word_index.items():
            if idx == np.argmax(final_output1):
                next_word = word

        # change the current word and current states
        current_word = next_word
        current_input_states = [decoder_state_h, decoder_state_c]

        # break if '<eos>' is predicted
        if next_word == '<eos>':
            break
        else:
            predicted_output += next_word + ' '

    print("Bot: " + predicted_output)

You: Hi


Bot: hi 


You: How are you?


Bot: scarred for life for a little 


You: Why scarred?


Bot: because you are a policeman 


You: Where are you from?


Bot: helsinki which is the capital of finland 


You: Where are you going tonight?


Bot: i am going to see the other side of the hotel 


You: Why are you going to see the other side of the hotel?


Bot: well we will see the other room or the room or the room are you 


You: why are you acting strange?


Bot: i am not in the middle of your life 


You: okay you are free to go


Bot: look up there later you are going to be a big evening 


You: What big evening?


Bot: what do you mean 


You: What do you mean?


Bot: i do not know that is wrong with you 


You: Something is wrong with you


Bot: what do you mean 


You: Nothing


Bot: i am going to be a little girl in the spears 


You: Thanks


Bot: you are welcome to be a crook do you want to be a crook 


You: No I don't want to be a crook


Bot: i do not think i do not think i do not think i do not think i do not want to do 


You: now you are just not making sense


Bot: i am not a little way for you 


You: bye


Bot: yeah huh 


You: quit


Bot: Bye


# Conclusion and Future Direction

During this project, we learnt that a seq2seq model with attention required lots of data and processing power to train. Despite the attention mechanism implemented, the lack of training and the subset of data selected did not allow the model to perform well. Furthermore, there was no hyperparameter tuning done to the model due to the hardware requirements. Our goal with this set up was to get the model to try produce "proper" sentence, which it was able to do to some extent. Also, it is nearly impossible to obtain a human-human like conversation with a chatbot due to the use of a movie lines dataset.

#### Future Direction
- Better hardware to use all available data and train for more epochs, or learn to progressively train the model
- Obtain realistic training data, not movie lines
- BERT embeddings weights used instead of generated own embeddings
- Learn and attempt to implement hierarchical neural attention encoder
- Test Transformer architecture chatbot and compare results

# References:

[1]: Dataset collect and information about Cornell movie dialog corpus dataset; Available: https://www.cs.cornell.edu/ cristian/CornellMovieDialogsCorpus.htm

[2]: Seq2Seq AI Chatbot with Attention Mechanism; Abonia Sojasingarayar; [Paper] Available: https://arxiv.org/ftp/arxiv/papers/2006/2006.02767.pdf

[3]: Sequence to Sequence Learning with Neural Networks; Ilya Sutskever, Oriol Vinyals, Quoc V. Le; [Paper] Available: https://arxiv.org/pdf/1409.3215v3.pdf

[4]: Effective Approaches to Attention-based Neural Machine Translation; Minh-Thang Luong, Hieu Pham, Christopher D. Manning; [Paper] Available: https://arxiv.org/pdf/1508.04025.pdf