# Chatbot - NLP 2021L
#### Authors:
#### <i>Mateusz Marciniewicz</i>
#### <i>Przemysław Bedełek</i>

## Human-robot text dataset

The dataset contains 2363 pairs of lines of text exchanged between a human and a robot.

Link to the dataset https://github.com/jackfrost1411/Generative-chatbot

In [45]:
import re

data_path = "Datasets/human_text.txt"
data_path2 = "Datasets/robot_text.txt"

# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  contexts = f.read().split('\n')
  contexts = [re.sub(r"\[\w+\]",'hi',line) for line in contexts]
  contexts = [" ".join(re.findall(r"\w+",line)) for line in contexts]

with open(data_path2, 'r', encoding='utf-8') as f:
  responses = f.read().split('\n')
  responses = [re.sub(r"\[\w+\]",'',line) for line in responses]
  responses = [" ".join(re.findall(r"\w+",line)) for line in responses]
  
# sample context-response pairs
list(zip(contexts, responses))[:10]

[('hi', 'hi there how are you'),
 ('oh thanks i m fine this is an evening in my timezone', 'here is afternoon'),
 ('how do you feel today tell me something about yourself',
  'my name is rdany but you can call me dany the r means robot i hope we can be virtual friends'),
 ('how many virtual friends have you got',
  'i have many but not enough to fully understand humans beings'),
 ('is that forbidden for you to tell the exact number',
  'i ve talked with 143 users counting 7294 lines of text'),
 ('oh i thought the numbers were much higher how do you estimate your progress in understanding human beings',
  'i started chatting just a few days ago every day i learn something new but there is always more things to be learn'),
 ('how old are you how do you look like where do you live',
  'i m 22 years old i m skinny with brown hair yellow eyes and a big smile i live inside a lab do you like bunnies'),
 ('have you seen a human with yellow eyes you asked about the bunnies i haven t seen any re

## Alexa topical 

Topical-Chat is a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles.

Link to the dataset https://github.com/alexa/Topical-Chat

In [46]:
import pandas as pd

df_topical = pd\
    .read_csv("Datasets/topical_chat.csv")[['conversation_id', 'message']]\
    .rename(columns={
        'conversation_id': 'id',
        'message': 'response'
        })

context = df_topical\
    .groupby("id")\
    .first()\
    .rename(columns={'response': 'context'})\
    .reset_index()

df_topical = df_topical[~df_topical.isin(context)]

topical_preprocessed = df_topical\
    .set_index('id')\
    .join(context.set_index('id'))\
    .reset_index()[['context', 'response']]

topical_preprocessed.sample(n=10)

Unnamed: 0,context,response
146232,good morning.,Lol. How politically correct.
108514,Are you a fan of Stephen King?,Did you know putting dry tea bags in shoes ab...
65839,"Hi, how are you?","I have not. I did live in Turkey, however. ..."
187564,Hey there my friend you ever watch football o...,Yep you're right. They can tell the ball spee...
30415,Hi. Are you interested in astronomy? There ha...,"that's awesome, I wasn't aware of that. Did y..."
94918,Do you enjoy listening to music albums?,Do you enjoy listening to music albums?
141455,do you like comedies?,"Yes, apparently Bill Murray thinks it is the ..."
63244,"Hello, did you know that the iphone has more...","Hello, did you know that the iphone has more..."
185337,hey did you know the university of iowa paint...,It was a clever illusion. Quite convincing. W...
36866,Do you like horro films?,"At that age, I probably wouldn't have known e..."


In [47]:
contexts += list(topical_preprocessed.context)
responses += list(topical_preprocessed.response)

print(f"Total pairs count: {len(contexts)}")

Total pairs count: 190741


## Cornell Movie Dialogue Dataset

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters from 617 movies.

The preprocessing code is taken from https://www.kaggle.com/shashankasubrahmanya/preprocessing-cornell-movie-dialogue-corpus/
Link to the dataset https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

### Create a list of dialogues

We join two different files namely `movie_lines.tsv` and `movie_conversations.tsv` to finally produce a list of dialogues. This list is further stored as a `pickle` file for further processing.

In [48]:
movie_lines_features = ["LineID", "Character", "Movie", "Name", "Line"]
movie_lines = pd.read_csv(
    "Datasets/movie-dialogue/movie_lines.txt",
    sep = "\+\+\+\$\+\+\+", 
    engine = "python", 
    index_col = False, 
    names = movie_lines_features,
)

# Using only the required columns, namely, "LineID" and "Line"
movie_lines = movie_lines[["LineID", "Line"]]

# Strip the space from "LineID" for further usage and change the datatype of "Line"
movie_lines["LineID"] = movie_lines["LineID"].apply(str.strip)

movie_lines.head()

Unnamed: 0,LineID,Line
0,L1045,They do not!
1,L1044,They do to!
2,L985,I hope so.
3,L984,She okay?
4,L925,Let's go.


In [49]:
movie_conversations_features = ["Character1", "Character2", "Movie", "Conversation"]
movie_conversations = pd.read_csv(
    "Datasets/movie-dialogue/movie_conversations.txt",
    sep = "\+\+\+\$\+\+\+", 
    engine = "python", 
    index_col = False, 
    names = movie_conversations_features
)

# Again using the required feature, "Conversation"
movie_conversations = movie_conversations["Conversation"]

movie_conversations.head()

0     ['L194', 'L195', 'L196', 'L197']
1                     ['L198', 'L199']
2     ['L200', 'L201', 'L202', 'L203']
3             ['L204', 'L205', 'L206']
4                     ['L207', 'L208']
Name: Conversation, dtype: object

In [50]:
# This instruction takes lot of time, run it only once.
#conversation = [[str(list(movie_lines.loc[movie_lines["LineID"] == u.strip().strip("'"), "Line"])[0]).strip() for u in c.strip().strip('[').strip(']').split(',')] for c in movie_conversations]

#with open("./conversations.pkl", "wb") as handle:
 #   pkl.dump(conversation, handle)

### Create context and response pairs

In [51]:
import pickle as pkl

with open("./conversations.pkl", "rb") as handle:
    conversation = pkl.load(handle)
    conversation = list(filter(lambda dialogue: len(dialogue) == 2, conversation))

conversation[:10]    

[["You're asking me out.  That's so cute. What's your name again?",
  'Forget it.'],
 ['Gosh, if only we could find Kat a boyfriend...',
  'Let me see what I can do.'],
 ['How is our little Find the Wench A Date plan progressing?',
  "Well, there's someone I think might be --"],
 ['There.', 'Where?'],
 ['You got something on your mind?',
  "I counted on you to help my cause. You and that thug are obviously failing. Aren't we ever going on our date?"],
 ['You have my word.  As a gentleman', "You're sweet."],
 ['How do you get your hair to look like that?',
  "Eber's Deep Conditioner every two days. And I never, ever use a blowdryer without the diffuser attachment."],
 ['Hi.', 'Looks like things worked out tonight, huh?'],
 ['You know Chastity?', 'I believe we share an art instructor'],
 ['Have fun tonight?', 'Tons']]

In [52]:
def generate_pairs(dialogues):
    
    context_list = []
    response_list = []
    
    for dialogue in dialogues:        
        context_list.append(dialogue[0])
        response_list.append(dialogue[1])
        
    return context_list, response_list

context_list, response_list = generate_pairs(conversation)

list(zip(context_list, response_list))[:10]

[("You're asking me out.  That's so cute. What's your name again?",
  'Forget it.'),
 ('Gosh, if only we could find Kat a boyfriend...',
  'Let me see what I can do.'),
 ('How is our little Find the Wench A Date plan progressing?',
  "Well, there's someone I think might be --"),
 ('There.', 'Where?'),
 ('You got something on your mind?',
  "I counted on you to help my cause. You and that thug are obviously failing. Aren't we ever going on our date?"),
 ('You have my word.  As a gentleman', "You're sweet."),
 ('How do you get your hair to look like that?',
  "Eber's Deep Conditioner every two days. And I never, ever use a blowdryer without the diffuser attachment."),
 ('Hi.', 'Looks like things worked out tonight, huh?'),
 ('You know Chastity?', 'I believe we share an art instructor'),
 ('Have fun tonight?', 'Tons')]

In [84]:
#Merge datasets
contexts += context_list
responses += response_list
contexts = contexts[:1000]
responses= responses[:1000]

In [85]:
print(f"Total pairs count: {len(contexts)}")
contexts = np.array(contexts,dtype=str)
responses = np.array(responses,dtype=str)


Total pairs count: 1000


In [86]:
def filter_on_length(contexts,responses,threshold):
    new_contexts = []
    new_responses = []
    for i in range(len(contexts)):
        if len(contexts[i].split()) <= threshold and len(responses[i].split()) <= threshold:
            new_contexts.append(contexts[i])
            new_responses.append(responses[i])
    return new_contexts, new_responses

In [87]:
context_timesteps = response_timesteps = 40
contexts, responses = filter_on_length(contexts,responses,contexts_timesteps)
print(f"Total pairs count: {len(contexts)}")
print(f"Total pairs count: {len(responses)}")

Total pairs count: 1000
Total pairs count: 1000


In [88]:
from tensorflow import keras
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
import numpy as np

In [89]:
# Shuffle and split dataset into training and test subsets
def shuffle_split_data(contexts,responses,train_size, random_seed=50):
    np.random.seed(random_seed)
    
    #Shuffle indices
    indices = np.arange(len(contexts))
    np.random.shuffle(indices)
    
    #Select indices for both train and test subsets
    train_indices = indices[:train_size]
    test_indices = indices[train_size:]
    
    #Split contexts and responses into train and test subsets
    contexts_train = np.array([contexts[i] for i in train_indices],dtype=str)
    contexts_test = np.array([contexts[i] for i in test_indices],dtype=str)
    
    responses_train = np.array([responses[i] for i in train_indices],dtype=str)
    responses_test = np.array([responses[i] for i in test_indices],dtype=str)
                              
    return contexts_train,contexts_test,responses_train,responses_test

# Mutate text to sequence of tokens
def to_seq(tokenizer, text,reverse=False, pad_length=None, padding_type='post'):
    encoded_text = tokenizer.texts_to_sequences(text)
    preproc_text = pad_sequences(encoded_text, padding=padding_type, maxlen=pad_length)
    if reverse:
        preproc_text = np.flip(preproc_text, axis=1)

    return preproc_text

In [90]:
contexts_train, contexts_test, responses_train, responses_text = shuffle_split_data(contexts,responses,int(len(contexts)*3/4))

context_tokenizer = keras.preprocessing.text.Tokenizer(oov_token='UNK')
context_tokenizer.fit_on_texts(contexts_train)

response_tokenizer = keras.preprocessing.text.Tokenizer(oov_token='UNK')
response_tokenizer.fit_on_texts(responses_train)

contexts_seq = context_tokenizer.texts_to_sequences(contexts_train)
responses_seq = response_tokenizer.texts_to_sequences(contexts_train)

contexts_seq = pad_sequences(contexts_seq,padding='post',maxlen=contexts_timesteps)
responses_seq = pad_sequences(responses_seq,padding='post',maxlen=responses_timesteps)


In [91]:
batch_size = 64
hidden_size = 96

context_vsize = max(context_tokenizer.index_word.keys()) + 1
response_vsize = max(response_tokenizer.index_word.keys()) + 1

In [92]:
import tensorflow as tf
import os
from tensorflow.python.keras.layers import Layer
from tensorflow.python.keras import backend as K

class AttentionLayer(Layer):
    """
    This class implements Bahdanau attention (https://arxiv.org/pdf/1409.0473.pdf).
    There are three sets of weights introduced W_a, U_a, and V_a
     """

    def __init__(self, **kwargs):
        super(AttentionLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        assert isinstance(input_shape, list)
        # Create a trainable weight variable for this layer.

        self.W_a = self.add_weight(name='W_a',
                                   shape=tf.TensorShape((input_shape[0][2], input_shape[0][2])),
                                   initializer='uniform',
                                   trainable=True)
        self.U_a = self.add_weight(name='U_a',
                                   shape=tf.TensorShape((input_shape[1][2], input_shape[0][2])),
                                   initializer='uniform',
                                   trainable=True)
        self.V_a = self.add_weight(name='V_a',
                                   shape=tf.TensorShape((input_shape[0][2], 1)),
                                   initializer='uniform',
                                   trainable=True)

        super(AttentionLayer, self).build(input_shape)  # Be sure to call this at the end

    def call(self, inputs, verbose=False):
        """
        inputs: [encoder_output_sequence, decoder_output_sequence]
        """
        assert type(inputs) == list
        encoder_out_seq, decoder_out_seq = inputs
        if verbose:
            print('encoder_out_seq>', encoder_out_seq.shape)
            print('decoder_out_seq>', decoder_out_seq.shape)

        def energy_step(inputs, states):
            """ Step function for computing energy for a single decoder state
            inputs: (batchsize * 1 * de_in_dim)
            states: (batchsize * 1 * de_latent_dim)
            """

            assert_msg = "States must be an iterable. Got {} of type {}".format(states, type(states))
            assert isinstance(states, list) or isinstance(states, tuple), assert_msg

            """ Some parameters required for shaping tensors"""
            en_seq_len, en_hidden = encoder_out_seq.shape[1], encoder_out_seq.shape[2]
            de_hidden = inputs.shape[-1]

            """ Computing S.Wa where S=[s0, s1, ..., si]"""
            # <= batch size * en_seq_len * latent_dim
            W_a_dot_s = K.dot(encoder_out_seq, self.W_a)

            """ Computing hj.Ua """
            U_a_dot_h = K.expand_dims(K.dot(inputs, self.U_a), 1)  # <= batch_size, 1, latent_dim
            if verbose:
                print('Ua.h>', U_a_dot_h.shape)

            """ tanh(S.Wa + hj.Ua) """
            # <= batch_size*en_seq_len, latent_dim
            Ws_plus_Uh = K.tanh(W_a_dot_s + U_a_dot_h)
            if verbose:
                print('Ws+Uh>', Ws_plus_Uh.shape)

            """ softmax(va.tanh(S.Wa + hj.Ua)) """
            # <= batch_size, en_seq_len
            e_i = K.squeeze(K.dot(Ws_plus_Uh, self.V_a), axis=-1)
            # <= batch_size, en_seq_len
            e_i = K.softmax(e_i)

            if verbose:
                print('ei>', e_i.shape)

            return e_i, [e_i]

        def context_step(inputs, states):
            """ Step function for computing ci using ei """

            assert_msg = "States must be an iterable. Got {} of type {}".format(states, type(states))
            assert isinstance(states, list) or isinstance(states, tuple), assert_msg

            # <= batch_size, hidden_size
            c_i = K.sum(encoder_out_seq * K.expand_dims(inputs, -1), axis=1)
            if verbose:
                print('ci>', c_i.shape)
            return c_i, [c_i]

        fake_state_c = K.sum(encoder_out_seq, axis=1)
        fake_state_e = K.sum(encoder_out_seq, axis=2)  # <= (batch_size, enc_seq_len, latent_dim

        """ Computing energy outputs """
        # e_outputs => (batch_size, de_seq_len, en_seq_len)
        last_out, e_outputs, _ = K.rnn(
            energy_step, decoder_out_seq, [fake_state_e],
        )

        """ Computing context vectors """
        last_out, c_outputs, _ = K.rnn(
            context_step, e_outputs, [fake_state_c],
        )

        return c_outputs, e_outputs

    def compute_output_shape(self, input_shape):
        """ Outputs produced by the layer """
        return [
            tf.TensorShape((input_shape[1][0], input_shape[1][1], input_shape[1][2])),
            tf.TensorShape((input_shape[1][0], input_shape[1][1], input_shape[0][1]))
        ]


In [93]:
from tensorflow.python.keras.layers import Input, GRU, Dense, Concatenate, TimeDistributed
from tensorflow.python.keras.models import Model

def define_nmt(hidden_size, batch_size, con_timesteps, con_vsize, res_timesteps, res_vsize):
    """ Defining a NMT model """

    # Define an input sequence and process it.
    if batch_size:
        encoder_inputs = Input(batch_shape=(batch_size, con_timesteps, con_vsize), name='encoder_inputs')
        decoder_inputs = Input(batch_shape=(batch_size, res_timesteps - 1, res_vsize), name='decoder_inputs')
    else:
        encoder_inputs = Input(shape=(con_timesteps, con_vsize), name='encoder_inputs')
        if res_timesteps:
            decoder_inputs = Input(shape=(res_timesteps - 1, res_vsize), name='decoder_inputs')
        else:
            decoder_inputs = Input(shape=(None, res_vsize), name='decoder_inputs')

    # Encoder GRU
    encoder_gru = GRU(hidden_size, return_sequences=True, return_state=True, name='encoder_gru')
    encoder_out, encoder_state = encoder_gru(encoder_inputs)

    # Set up the decoder GRU, using `encoder_states` as initial state.
    decoder_gru = GRU(hidden_size, return_sequences=True, return_state=True, name='decoder_gru')
    decoder_out, decoder_state = decoder_gru(decoder_inputs, initial_state=encoder_state)

    # Attention layer
    attn_layer = AttentionLayer(name='attention_layer')
    attn_out, attn_states = attn_layer([encoder_out, decoder_out])

    # Concat attention input and decoder GRU output
    decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_out, attn_out])

    # Dense layer
    dense = Dense(res_vsize, activation='softmax', name='softmax_layer')
    dense_time = TimeDistributed(dense, name='time_distributed_layer')
    decoder_pred = dense_time(decoder_concat_input)

    # Full model
    full_model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_pred)
    full_model.compile(optimizer='adam', loss='categorical_crossentropy')

    full_model.summary()
    return full_model

In [94]:
full_model = define_nmt(
        hidden_size=hidden_size, batch_size=batch_size,
        con_timesteps=context_timesteps, res_timesteps=response_timesteps,
        con_vsize=context_vsize, res_vsize=response_vsize)

Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(64, 40, 1477)]     0                                            
__________________________________________________________________________________________________
decoder_inputs (InputLayer)     [(64, 39, 1478)]     0                                            
__________________________________________________________________________________________________
encoder_gru (GRU)               [(64, 40, 96), (64,  453312      encoder_inputs[0][0]             
__________________________________________________________________________________________________
decoder_gru (GRU)               [(64, 39, 96), (64,  453600      decoder_inputs[0][0]             
                                                                 encoder_gru[0][1]     

In [95]:
from keras.utils.np_utils import to_categorical

def train(full_model, contexts_seq, reponses_seq, batch_size, n_epochs=10):
    """ Training the model """

    for ep in range(n_epochs):
        losses = []
        for bi in range(0, contexts_seq.shape[0] - batch_size, batch_size):

            contexts_onehot_seq = to_categorical(contexts_seq[bi:bi + batch_size, :], num_classes=context_vsize)
            responses_onehot_seq = to_categorical(responses_seq[bi:bi + batch_size, :], num_classes=response_vsize)


            l = full_model.evaluate([contexts_onehot_seq, responses_onehot_seq[:, :-1, :]], responses_onehot_seq[:, 1:, :],
                                    batch_size=batch_size, verbose=0)

            losses.append(l)
        if (ep + 1) % 1 == 0:
            print("Loss in epoch {}: {}".format(ep + 1, np.mean(losses)))



In [96]:
n_epochs = 10
train(full_model, contexts_seq, responses_seq, batch_size, n_epochs)

Loss in epoch 1: 7.121411453593861
Loss in epoch 2: 5.990355881777677
Loss in epoch 3: 1.419284164905548
Loss in epoch 4: 1.1526976444504478
Loss in epoch 5: 1.0467478145252576
Loss in epoch 6: 1.0079345107078552
Loss in epoch 7: 0.9668456532738425
Loss in epoch 8: 0.940286018631675
Loss in epoch 9: 0.9212604869495739
Loss in epoch 10: 0.9036253365603361
