# Honor Project: Custom Conversational Model
- cornell dataset is used to train the model
- GloVe embeddings 300D is used
- LSTM + attention model

### Data Description
Data fields are separated by " +++$+++ "
- `movie_lines.txt`:<br>
    - lineId
    - characterId
    - movieId
    - character name
    - text of utterance
            
- `movie_conversations.txt`:<br>
    - characterID of the first character involved in the conversation
    - characterID of the second character involved in the conversation
    - movieId
    - list of the utterances

In [1]:
import numpy as np
import pandas as pd
import re
import random
import time
import nltk
import warnings
warnings.filterwarnings('ignore')

from nltk.corpus import stopwords

In [56]:
# load in datasets
conversations_df = open('data/cornell/movie_conversations.txt').read().split('\n')
conversations_df = [line.split(' +++$+++ ') for line in conversations_df]

lines_df = open('data/cornell/movie_lines.txt', encoding='utf8', errors='ignore').read().split('\n')
lines_df = [line.split(' +++$+++ ') for line in lines_df]

In [57]:
# dictionary to convert id to line
id2line = {}
for line in lines_df:
    if len(line)==5:
        id2line[line[0]] = line[4]
        
# list of utterances id
conversations = []
for line in conversations_df:
    line = line[-1][1:-1]
    line = line.replace("'", "").split(', ')
    conversations.append(line)

In [58]:
# question, answer
questions = []
answers = []

for conv in conversations:
    for i in range(len(conv)-1):
        questions.append(id2line[conv[i]])
        answers.append(id2line[conv[i+1]])

In [59]:
display_num = 5
for i in range(display_num):
    print(questions[i])
    print(answers[i])
    print()

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
Well, I thought we'd start with pronunciation, if that's okay with you.

Well, I thought we'd start with pronunciation, if that's okay with you.
Not the hacking and gagging and spitting part.  Please.

Not the hacking and gagging and spitting part.  Please.
Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?

You're asking me out.  That's so cute. What's your name again?
Forget it.

No, no, it's my fault -- we didn't have a proper introduction ---
Cameron.



In [60]:
# output questions, answers
questions_path = 'data/cornell/questions.txt'
answers_path = 'data/cornell/answers.txt'

out = open(questions_path, 'w')
for line in questions:
    print(line, sep='\t', file=out)
out.close()

out = open(answers_path, 'w')
for line in answers:
    print(line, sep='\t', file=out)
out.close()

### GloVe Embeddings
glove.6B.300d is used in this model. We should get the vocabulary as close to the embeddings as possible. Let's create vocabulary from raw train data and check the intersection between our vocabulary and the embeddins.

In [61]:
questions = []
with open('data/cornell/questions.txt', 'r') as f:
    for line in f:
        questions.append(line.strip())
    
answers = []
with open('data/cornell/answers.txt', 'r') as f:
    for line in f:
        answers.append(line.strip())

In [62]:
# split the datasets to train, test set
from sklearn.model_selection import train_test_split

questions_train, questions_test, answers_train, answers_test = train_test_split(questions, answers, test_size=0.1)

In [63]:
# tokenize the data using TweetTokenizer
from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer()
tokenized_questions = [tokenizer.tokenize(sentence) for sentence in questions_train]
tokenized_answers = [tokenizer.tokenize(sentence) for sentence in answers_train]

In [64]:
display_num = 5

for i in range(display_num):
    print(tokenized_questions[i])
    print(tokenized_answers[i])
    print()

["He's", 'in', 'Jamaica', 'with', 'a', 'twenty-three', '-', 'year-old', '.', 'A', 'friend', 'of', 'my', "daughter's", '.', 'He', 'had', 'the', 'fucking', 'nerve', 'to', 'call', 'me', 'and', 'ask', 'me', 'to', 'borrow', 'some', 'money', 'and', 'I', 'told', 'him', 'to', 'fuck', 'off', ',', 'so', 'he', 'asked', 'me', 'to', 'sell', 'his', 'singles', 'collection', 'and', 'send', 'him', 'a', 'check', 'for', 'whatever', 'I', 'go', ',', 'minus', 'a', 'ten', 'percent', 'commission', '.', 'Which', 'reminds', 'me', '.', 'Can', 'you', 'make', 'sure', 'you', 'give', 'me', 'a', 'five', '?', 'I', 'want', 'to', 'frame', 'it', 'and', 'put', 'it', 'on', 'the', 'wall', '.']
['It', 'must', 'have', 'taken', 'him', 'a', 'long', 'time', 'to', 'get', 'them', 'together', '.']

['The', 'one', 'where', 'you', 'go', 'to', 'the', 'slave', 'market', '.', 'You', 'can', 'cut', 'right', 'to', 'the', 'scene', 'where', 'John', 'the', 'Baptist', '-']
['Cut', 'away', 'from', 'me', '?']

['What', '?', 'Why', '?']
['The', '

In [65]:
# build vocabulary with number of occurances
vocab_occ = {}
for dataset in [tokenized_questions, tokenized_answers]:
    for sentence in dataset:
        for word in sentence:
            vocab_occ[word] = vocab_occ.get(word, 0) + 1

In [66]:
# load GloVe
from gensim.models import KeyedVectors
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

# convert GloVe vectors into the word2vec
glove_file = 'glove.6B.300d.txt'
tmp_file = 'glove_word2vec.txt'
glove2word2vec(glove_file, tmp_file)

embeddings = KeyedVectors.load_word2vec_format(tmp_file)

In [67]:
import operator

def check_coverage(vocab, embeddings):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in vocab:
        try:
            a[word] = embeddings[word]
            k += vocab[word]
        except:
            oov[word] = vocab[word]
            i += vocab[word]
    print('Found embeddings for {:.2%} of vocab'.format(len(a)/len(vocab)))
    print('Found embeddings for {:.2%} of all texts'.format(k/(k+i)))
    
    sorted_oov = sorted(oov.items(), key=operator.itemgetter(1))[::-1]
    
    return sorted_oov          

In [68]:
oov = check_coverage(vocab_occ, embeddings)

Found embeddings for 44.14% of vocab
Found embeddings for 78.92% of all texts


Only 44.23% of vocabulary has embeddings in GloVe. Let's take a look at oov.

In [69]:
oov[:15]

[('I', 136746),
 ('You', 36536),
 ("I'm", 29491),
 ("don't", 26419),
 ('What', 22480),
 ('No', 15840),
 ("It's", 13896),
 ('The', 13891),
 ('And', 13490),
 ('But', 10886),
 ('Well', 9876),
 ('Oh', 9683),
 ("you're", 9571),
 ("it's", 9508),
 ('He', 9373)]

Most of oov is capital letter. Let's convert all letter into lowercase.

In [70]:
# vocab of lowercase
vocab_occ = {}
for dataset in [tokenized_questions, tokenized_answers]:
    for sentence in dataset:
        for word in sentence:
            word = word.lower()
            vocab_occ[word] = vocab_occ.get(word, 0) + 1

In [71]:
oov = check_coverage(vocab_occ, embeddings)

Found embeddings for 71.05% of vocab
Found embeddings for 94.22% of all texts


In [72]:
oov[:15]

[("don't", 32074),
 ("i'm", 29564),
 ("it's", 23451),
 ("you're", 17785),
 ("that's", 14573),
 ("can't", 8681),
 ("i'll", 8621),
 ("he's", 8593),
 ("didn't", 7907),
 ("i've", 6990),
 ("what's", 6155),
 ("we're", 5721),
 ("there's", 5028),
 ('</u>', 4494),
 ('<u>', 4491)]

71.10% of vocabulary has embeddings now.<br>
Most of oov has n't, 'd, 've. Let's try nltk.word_tokenizer.

In [73]:
import nltk

tokenized_questions = [nltk.word_tokenize(sentence) for sentence in questions_train]
tokenized_answers = [nltk.word_tokenize(sentence) for sentence in answers_train]

In [74]:
# vocab of lowercase
vocab_occ = {}
for dataset in [tokenized_questions, tokenized_answers]:
    for sentence in dataset:
        for word in sentence:
            word = word.lower()
            vocab_occ[word] = vocab_occ.get(word, 0) + 1

In [75]:
oov = check_coverage(vocab_occ, embeddings)

Found embeddings for 74.34% of vocab
Found embeddings for 99.26% of all texts


74.33% now! Let's start building model.

In [76]:
# output questions_train, answers_train, questions_test, answers_test
questions_train_path = 'data/cornell/questions_train.txt'
answers_train_path = 'data/cornell/answers_train.txt'
questions_test_path = 'data/cornell/questions_test.txt'
answers_test_path = 'data/cornell/answers_test.txt'

dataset_list = [questions_train, answers_train, questions_test, answers_test]
path_list = [questions_train_path, answers_train_path, questions_test_path, answers_test_path]

for dataset, path in zip(dataset_list, path_list):
    out = open(path, 'w')
    for line in dataset:
        print(line, sep='\t', file=out)
    out.close()

### Data Preprocessing
Datasets should be tokenized by nltk.word_tokenize, then lowercased.<br>
After that, OOV should be replaced by <UNK> token.

In [2]:
questions_train_path = 'data/cornell/questions_train.txt'
answers_train_path = 'data/cornell/answers_train.txt'
questions_test_path = 'data/cornell/questions_test.txt'
answers_test_path = 'data/cornell/answers_test.txt'

questions_train, answers_train, questions_test, answers_test = [], [], [], []

dataset_list = [questions_train, answers_train, questions_test, answers_test]
path_list = [questions_train_path, answers_train_path, questions_test_path, answers_test_path]

for dataset, path in zip(dataset_list, path_list):
    with open(path, 'r') as f:
        for line in f:
            dataset.append(line.strip())

In [3]:
start_symbol = '<S>'
end_symbol = '</S>'
padding_symbol = '<PAD>'
unknown_symbol = '<UNK>'

special_symbols = [start_symbol, end_symbol, padding_symbol, unknown_symbol]

In [4]:
def clean_text(text):               
    text = re.sub(r"<[^>]*>", "", text)
    text = re.sub(r"[<>]", "", text)
                               
    return text

In [5]:
def preprocess_dataset(dataset):
    preprocessed_dataset = []
    for sentence in dataset:
        cleaned_sentence = clean_text(sentence)
        tokenized_sentence = nltk.word_tokenize(cleaned_sentence)
        final_sentence = [word.lower() for word in tokenized_sentence]
        preprocessed_dataset.append(final_sentence)
        
    return preprocessed_dataset

In [8]:
# train set
tokenized_questions_train = preprocess_dataset(questions_train)
tokenized_answers_train = preprocess_dataset(answers_train)

# test set
tokenized_questions_test = preprocess_dataset(questions_test)
tokenized_answers_test = preprocess_dataset(answers_test)

In [9]:
# check occurances of words
vocab_occ = {}
for dataset in [tokenized_questions_train, tokenized_answers_train]:
    for sentence in dataset:
        for word in sentence:
            vocab_occ[word] = vocab_occ.get(word, 0) + 1
vocab_occ = sorted(vocab_occ.items(), key=lambda kv: kv[1])[::-1]

In [10]:
vocab_occ

[('.', 435695),
 (',', 217326),
 ('you', 193685),
 ('i', 185806),
 ('?', 147218),
 ('the', 126954),
 ('to', 104712),
 ('a', 92028),
 ('it', 85906),
 ("'s", 85811),
 ("n't", 72931),
 ('...', 65942),
 ('do', 62397),
 ('that', 61177),
 ('and', 59657),
 ('of', 50743),
 ('what', 50277),
 ('!', 45558),
 ('in', 44176),
 ('me', 42152),
 ('is', 40728),
 ('we', 36633),
 ('he', 35769),
 ('--', 34923),
 ('this', 30758),
 ('for', 30332),
 ('have', 29847),
 ("'m", 29771),
 ('know', 28871),
 ('was', 28139),
 ("'re", 28127),
 ('your', 27037),
 ('my', 26879),
 ('not', 26590),
 ('no', 26048),
 ('be', 25114),
 ('on', 24964),
 ('but', 23009),
 ('with', 22591),
 ('are', 22534),
 ('they', 22111),
 ('just', 20637),
 ('like', 19609),
 ('all', 19596),
 ('did', 19357),
 ('about', 18665),
 ('there', 18608),
 ("'ll", 18382),
 ('get', 18054),
 ('so', 17363),
 ('if', 17167),
 ('got', 17051),
 ('out', 17040),
 ('here', 16278),
 ('she', 16203),
 ('him', 15629),
 ('how', 14976),
 ('up', 14970),
 ('can', 14934),
 ('wan

Stopwords, "i", "you" are too frequent. Should be removed?

In [120]:
def build_dict(tokenized_questions, tokenized_answers, special_symbols):
    word2id = {}
    id2word = []
    
    for special_symbol in special_symbols:
        id2word.append(special_symbol)
        word2id[special_symbol] = id2word.index(special_symbol)
        
    vocab_set = set(word for dataset in [tokenized_questions, tokenized_answers]
                    for sentence in dataset
                    for word in sentence
                    if word not in special_symbols)
     
    for word in vocab_set:
        id2word.append(word)
        word2id[word] = id2word.index(word)
        
    return word2id, id2word

In [121]:
word2id, id2word = build_dict(tokenized_questions_train, tokenized_answers_train, special_symbols)

In [163]:
def replace_with_unk(dataset, word2id):
    replaced_dataset = []
    for sentence in dataset:
        for i, word in enumerate(sentence):
            if word not in word2id.keys():
                sentence[i] = '<UNK>'
        replaced_dataset.append(sentence)
        
    return replaced_dataset

In [164]:
tokenized_questions_test = replace_with_unk(tokenized_questions_test, word2id)
tokenized_answers_test = replace_with_unk(tokenized_answers_test, word2id)

In [165]:
# check oov
oov = []
for dataset in [tokenized_questions_test, tokenized_answers_test]:
    for sentence in dataset:
        for word in sentence:
            if word not in word2id.keys():
                oov.append(word)
set(oov)

set()

In [168]:
# number of <UNK> in test set
count = 0
for dataset in [tokenized_questions_test, tokenized_answers_test]:
    for sentence in dataset:
        for word in sentence:
            if word == '<UNK>':
                count += 1
                
count 

1946

In [124]:
def build_embeddings(word2id, embeddings, dim=300):
    vocab_size = len(word2id)
    embedding_matrix = np.random.normal(0, 1, (vocab_size, dim))
    
    for word, i in word2id.items():
        try:
            embedding_vector = embeddings.get_vector(word)
            embedding_matrix[i] = embedding_vector
        except:
            continue
            
    return embedding_matrix

In [125]:
customized_embeddings = build_embeddings(word2id, embeddings, 300)

In [127]:
# save customized_embeddings
path = 'word_embeddings.txt'

np.savetxt(path, customized_embeddings, delimiter=' ')

In [128]:
# save word2id dictionary
path = 'word2id.txt'

out = open(path, 'w')
for word, i in word2id.items():
    print(word, i, sep=' ', file=out)
out.close()

In [129]:
def sentence_to_ids(tokenized_sentence, word2id, padded_len):
    num_pad = max(0, padded_len - 1 - len(tokenized_sentence))
    sent = tokenized_sentence[:padded_len-1] + ['</S>']
    sent = sent + ['<PAD>']*num_pad
    sent_ids = [word2id[word] for word in sent]
    
    sent_len = min(len(tokenized_sentence)+1, padded_len)
    
    return sent_ids, sent_len

In [130]:
def ids_to_sentence(ids, id2word):
    return [id2word[i] for i in ids]

In [131]:
def batch_to_ids(sentences, word2id, max_len):
    max_len_in_batch = min(max(len(s) for s in sentences) + 1, max_len)
    batch_ids, batch_ids_len = [], []
    for sentence in sentences:
        ids, ids_len = sentence_to_ids(sentence, word2id, max_len_in_batch)
        batch_ids.append(ids)
        batch_ids_len.append(ids_len)
        
    return batch_ids, batch_ids_len

In [132]:
def generate_batches(samples, batch_size=32):
    X, Y = [], []
    for i, (x, y) in enumerate(samples, 1):
        X.append(x)
        Y.append(y)
        if i % batch_size == 0:
            yield X, Y
            X, Y = [], []
    if X and Y:
        yield X, Y

In [133]:
sentences = list(zip(tokenized_questions_train, tokenized_answers_train))[0]
ids, sent_lens = batch_to_ids(sentences, word2id, max_len=10)

print('Input:', sentences)
print('Ids: {}\nSentences lengths: {}'.format(ids, sent_lens))

Input: (['he', "'s", 'in', 'jamaica', 'with', 'a', 'twenty-three-', 'year-old', '.', 'a', 'friend', 'of', 'my', 'daughter', "'s", '.', 'he', 'had', 'the', 'fucking', 'nerve', 'to', 'call', 'me', 'and', 'ask', 'me', 'to', 'borrow', 'some', 'money', 'and', 'i', 'told', 'him', 'to', 'fuck', 'off', ',', 'so', 'he', 'asked', 'me', 'to', 'sell', 'his', 'singles', 'collection', 'and', 'send', 'him', 'a', 'check', 'for', 'whatever', 'i', 'go', ',', 'minus', 'a', 'ten', 'percent', 'commission', '.', 'which', 'reminds', 'me', '.', 'can', 'you', 'make', 'sure', 'you', 'give', 'me', 'a', 'five', '?', 'i', 'want', 'to', 'frame', 'it', 'and', 'put', 'it', 'on', 'the', 'wall', '.'], ['it', 'must', 'have', 'taken', 'him', 'a', 'long', 'time', 'to', 'get', 'them', 'together', '.'])
Ids: [[19918, 52248, 48184, 28312, 21456, 32223, 17323, 26616, 45234, 1], [23978, 19668, 19627, 15925, 26524, 32223, 48496, 28852, 29618, 1]]
Sentences lengths: [10, 10]


## Encoder-Decoder architecture

In [134]:
import tensorflow as tf

In [135]:
class Seq2SeqModel(object):
    pass

First, we need to create [placeholders](https://www.tensorflow.org/api_guides/python/io_ops#Placeholders) to specify what data we are going to feed into the network during the execution time. For this task we will need:
 - *input_batch* — sequences of sentences (the shape will equal to [batch_size, max_sequence_len_in_batch]);
 - *input_batch_lengths* — lengths of not padded sequences (the shape equals to [batch_size]);
 - *ground_truth* — sequences of groundtruth (the shape will equal to [batch_size, max_sequence_len_in_batch]);
 - *ground_truth_lengths* — lengths of not padded groundtruth sequences (the shape equals to [batch_size]);
 - *dropout_ph* — dropout keep probability; this placeholder has a predifined value 1;
 - *learning_rate_ph* — learning rate.

In [136]:
def declare_placeholders(self):
    # placeholders for input and its actual lengths
    self.input_batch = tf.placeholder(shape=(None, None), dtype=tf.int32, name='input_batch')
    self.input_batch_lengths = tf.placeholder(shape=(None, ), dtype=tf.int32, name='input_batch_lengths')
    
    # placeholders for groundtruth and its actual lenghts
    self.ground_truth = tf.placeholder(shape=(None, None), dtype=tf.int32, name='ground_truth')
    self.ground_truth_lengths = tf.placeholder(shape=(None, ), dtype=tf.int32, name='ground_truth_lengths')
    
    # placeholders for dropout_rate and learning_rate
    self.dropout_ph = tf.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])
    self.learning_rate_ph = tf.placeholder(dtype=tf.float32, shape=[])

In [137]:
Seq2SeqModel.__declare_placeholders = classmethod(declare_placeholders)

In [138]:
def create_embeddings(self, embeddings_matrix):
    self.embeddings = tf.get_variable(name='embeddings', 
                                     shape=embeddings_matrix.shape,
                                     initializer=tf.constant_initializer(embeddings_matrix),
                                     trainable=False)
    self.input_batch_embedded = tf.nn.embedding_lookup(self.embeddings, self.input_batch)

In [139]:
Seq2SeqModel.__create_embeddings = classmethod(create_embeddings)

### Encoder

In [140]:
def build_encoder(self, hidden_size):
    forward_cell = tf.nn.rnn_cell.DropoutWrapper(
        tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_size),
        input_keep_prob=self.dropout_ph,
        output_keep_prob=self.dropout_ph,
        state_keep_prob=self.dropout_ph)
    
    backward_cell = tf.nn.rnn_cell.DropoutWrapper(
        tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_size),
        input_keep_prob=self.dropout_ph,
        output_keep_prob=self.dropout_ph,
        state_keep_prob=self.dropout_ph)
    
    output, final_state = tf.nn.bidirectional_dynamic_rnn(
        cell_fw=forward_cell,
        cell_bw=backward_cell,
        inputs=self.input_batch_embedded,
        sequence_length=self.input_batch_lengths,
        dtype=tf.float32)
    
    self.encoder_output = tf.concat([output[0], output[1]], axis=2)
    
    encoder_final_state_c = tf.concat([final_state[0].c, final_state[1].c], axis=1)
    encoder_final_state_h = tf.concat([final_state[0].h, final_state[1].h], axis=1)
    self.encoder_final_state = tf.contrib.rnn.LSTMStateTuple(c=encoder_final_state_c, h=encoder_final_state_h)

In [141]:
Seq2SeqModel.__build_encoder = classmethod(build_encoder)

### Decoder

In [142]:
def build_decoder(self, hidden_size, vocab_size, max_iter, start_symbol_id, end_symbol_id):
    batch_size = tf.shape(self.input_batch)[0]
    start_tokens = tf.fill([batch_size], start_symbol_id)
    ground_truth_as_input = tf.concat([tf.expand_dims(start_tokens, 1), self.ground_truth], 1)
    
    # Use the embedding layer defined before to lookup embedings for ground_truth_as_input
    self.ground_truth_embedded = tf.nn.embedding_lookup(self.embeddings, ground_truth_as_input)
    
    # Create TrainingHelper for the train stage
    train_helper = tf.contrib.seq2seq.TrainingHelper(self.ground_truth_embedded,
                                                     self.ground_truth_lengths)
        
    # Create GreedyEmbeddingHelper for the inference stage
    infer_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(self.embeddings, start_tokens, end_symbol_id)
    
    def decode(helper, scope, reuse=None):
        """Creates decoder and return the results of the decoding with a given helper."""
        
        with tf.variable_scope(scope, reuse=reuse):
            attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
                num_units=hidden_size, 
                memory=self.encoder_output,
                memory_sequence_length=self.input_batch_lengths)
            
            cell = tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_size*2, reuse=reuse)
#             cell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_size, reuse=reuse),
#                                                 input_keep_prob=self.dropout_ph,
#                                                 output_keep_prob=self.dropout_ph,
#                                                 state_keep_prob=self.dropout_ph)
    
            attention_cell = tf.contrib.seq2seq.AttentionWrapper(
                cell, attention_mechanism, attention_layer_size=hidden_size)
#             decoder_cell = tf.contrib.rnn.DropoutWrapper(tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_size, reuse=reuse))
            
            decoder_cell = tf.contrib.rnn.OutputProjectionWrapper(
                attention_cell, vocab_size, reuse=reuse)
            
            decoder_initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=batch_size)
            decoder = tf.contrib.seq2seq.BasicDecoder(
                cell=decoder_cell,
                helper=helper,
                initial_state=decoder_initial_state)
            
            outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(
                decoder=decoder,
                maximum_iterations=max_iter,
                output_time_major=False,
                impute_finished=True)
            
            return outputs
    
    self.train_outputs = decode(train_helper, 'decode')
    self.infer_outputs = decode(infer_helper, 'decode', reuse=True)

In [143]:
Seq2SeqModel.__build_decoder = classmethod(build_decoder)

In [144]:
def compute_loss(self):
    """Computes sequence loss (masked cross-entopy loss with logits)."""
    
    weights = tf.cast(tf.sequence_mask(self.ground_truth_lengths), dtype=tf.float32)
    
    self.loss = tf.contrib.seq2seq.sequence_loss(self.train_outputs.rnn_output,
                                                 self.ground_truth,
                                                 weights)

In [145]:
Seq2SeqModel.__compute_loss = classmethod(compute_loss)

In [146]:
def perform_optimization(self):
    self.train_op = tf.contrib.layers.optimize_loss(loss=self.loss,
                                                    optimizer='Adam',
                                                    learning_rate=self.learning_rate_ph,
                                                    clip_gradients=1.0,
                                                    global_step=tf.train.get_global_step())

In [147]:
Seq2SeqModel.__perform_optimization = classmethod(perform_optimization)

In [148]:
def init_model(self, embeddings_matrix, hidden_size, vocab_size, max_iter, 
               start_symbol_id, end_symbol_id, padding_symbol_id):
    self.__declare_placeholders()
    self.__create_embeddings(embeddings_matrix)
    self.__build_encoder(hidden_size)
    self.__build_decoder(hidden_size, vocab_size, max_iter, start_symbol_id, end_symbol_id)
    
    self.__compute_loss()
    self.__perform_optimization()
    
    self.train_predictions = self.train_outputs.sample_id
    self.infer_predictions = self.infer_outputs.sample_id

In [149]:
Seq2SeqModel.__init__ = classmethod(init_model)

## Train the network and predict output

In [150]:
def train_on_batch(self, session, X, X_seq_len, Y, Y_seq_len, learning_rate, dropout_keep_probability):
    feed_dict = {
            self.input_batch: X,
            self.input_batch_lengths: X_seq_len,
            self.ground_truth: Y,
            self.ground_truth_lengths: Y_seq_len,
            self.learning_rate_ph: learning_rate,
            self.dropout_ph: dropout_keep_probability
        }
    pred, loss, _ = session.run([
            self.train_predictions,
            self.loss,
            self.train_op], feed_dict=feed_dict)
    return pred, loss

In [151]:
Seq2SeqModel.train_on_batch = classmethod(train_on_batch)

In [152]:
def predict_for_batch(self, session, X, X_seq_len):
    feed_dict = {self.input_batch: X, self.input_batch_lengths: X_seq_len}
    pred = session.run([
            self.infer_predictions
        ], feed_dict=feed_dict)[0]
    return pred

def predict_for_batch_with_loss(self, session, X, X_seq_len, Y, Y_seq_len):
    feed_dict = {self.input_batch: X, 
                 self.input_batch_lengths: X_seq_len,
                 self.ground_truth: Y,
                 self.ground_truth_lengths: Y_seq_len}
    pred, loss = session.run([
            self.infer_predictions,
            self.loss,
        ], feed_dict=feed_dict)
    return pred, loss

In [153]:
Seq2SeqModel.predict_for_batch = classmethod(predict_for_batch)
Seq2SeqModel.predict_for_batch_with_loss = classmethod(predict_for_batch_with_loss)

In [249]:
tf.reset_default_graph()

model = Seq2SeqModel(
    embeddings_matrix=customized_embeddings,
    hidden_size=128,
    vocab_size=customized_embeddings.shape[0],
    max_iter=15, 
    start_symbol_id=word2id['<S>'],
    end_symbol_id=word2id['</S>'],
    padding_symbol_id=word2id['<PAD>'])

batch_size = 32
n_epochs = 10
learning_rate = 0.001
dropout_keep_probability = 0.5
max_len = 15
learning_rate_decay = 0.75
min_learning_rate = 0.0001

n_step = int(len(questions_train)/batch_size)

In [250]:
session = tf.Session()
session.run(tf.global_variables_initializer())

all_model_predictions = []
all_ground_truth = []

checkpoint = "model/2_trial/best_model.ckpt"
stop_early = 0
stop = 5
# validation_check = ((len(tokenized_questions_train))//batch_size//2)-1
summary_test_loss = []

train_set = list(zip(tokenized_questions_train, tokenized_answers_train))
test_set = list(zip(tokenized_questions_test, tokenized_answers_test))

for epoch in range(n_epochs):
    random.shuffle(train_set)
    random.shuffle(test_set)
       
    print('-'*30)
    print('Train: epoch', epoch + 1)
    for n_iter, (X_batch, Y_batch) in enumerate(generate_batches(train_set, batch_size)):
        start_time = time.time()      
        X_ids, X_sent_lens = batch_to_ids(X_batch, word2id, max_len)
        Y_ids, Y_sent_lens = batch_to_ids(Y_batch, word2id, max_len)
        
        predictions, loss = model.train_on_batch(
            session,
            X_ids,
            X_sent_lens,
            Y_ids,
            Y_sent_lens,
            learning_rate,
            dropout_keep_probability)
        
        end_time = time.time()
        batch_time = end_time - start_time
        if n_iter % 200 == 0:
            print("Epoch: {:>3}/{}, Step: {:>4}/{}, Loss: {:>6.3f}, Seconds: {:>4.2f}"
                  .format(epoch+1, n_epochs, n_iter+1, n_step, loss, batch_time*200))
            print('')
#             print("Epoch: [%d/%d], step: [%d/%d], loss: %f" % (epoch+1, n_epochs, n_iter+1, n_step, loss))
    
    start_time = time.time()
    epoch_test_loss = []
    for n_iter, (X_batch, Y_batch) in enumerate(generate_batches(test_set, batch_size=batch_size)):        
        X, X_sent_lens = batch_to_ids(X_sent, word2id, max_len)
        Y, Y_sent_lens = batch_to_ids(Y_sent, word2id, max_len)

        predictions, loss = model.predict_for_batch_with_loss(
            session,
            X,
            X_sent_lens,
            Y,
            Y_sent_lens)
        
        epoch_test_loss.append(loss)
        
    end_time = time.time()
    batch_time = end_time - start_time
    print('Test: epoch', epoch+1, 'loss', np.mean(epoch_test_loss), 'Second:', batch_time)   
    for x, y, p in list(zip(X, Y, predictions))[:3]:
        print('X:', ' '.join(ids_to_sentence(x, id2word)))
        print('Y:', ' '.join(ids_to_sentence(y, id2word)))
        print('O:', ' '.join(ids_to_sentence(p, id2word)))
        print('')
    
    # reduce learning rate
    learning_rate *= learning_rate_decay
    learning_rate = max(learning_rate, min_learning_rate)
    
    summary_test_loss.append(np.mean(epoch_test_loss))
    if loss <= min(summary_test_loss):
        print('New Record!')
        print('')
        stop_early = 0
        saver = tf.train.Saver()
        saver.save(session, checkpoint)
    else:
        print('No Improvement')
        stop_early += 1
        if stop_early == stop:
            break
            
print('\n...training finished.')

------------------------------
Train: epoch 1
Epoch:   1/10, Step:    1/6232, Loss: 10.960, Seconds: 462.76

Epoch:   1/10, Step:  201/6232, Loss:  5.357, Seconds: 62.03

Epoch:   1/10, Step:  401/6232, Loss:  4.995, Seconds: 61.63

Epoch:   1/10, Step:  601/6232, Loss:  4.421, Seconds: 62.23

Epoch:   1/10, Step:  801/6232, Loss:  4.598, Seconds: 62.03

Epoch:   1/10, Step: 1001/6232, Loss:  4.427, Seconds: 61.64

Epoch:   1/10, Step: 1201/6232, Loss:  4.612, Seconds: 61.83

Epoch:   1/10, Step: 1401/6232, Loss:  4.435, Seconds: 61.83

Epoch:   1/10, Step: 1601/6232, Loss:  4.366, Seconds: 61.83

Epoch:   1/10, Step: 1801/6232, Loss:  4.608, Seconds: 61.84

Epoch:   1/10, Step: 2001/6232, Loss:  4.251, Seconds: 61.44

Epoch:   1/10, Step: 2201/6232, Loss:  4.604, Seconds: 61.63

Epoch:   1/10, Step: 2401/6232, Loss:  4.492, Seconds: 61.84

Epoch:   1/10, Step: 2601/6232, Loss:  4.551, Seconds: 61.83

Epoch:   1/10, Step: 2801/6232, Loss:  4.552, Seconds: 61.83

Epoch:   1/10, Step: 30

Epoch:   4/10, Step:  201/6232, Loss:  4.054, Seconds: 63.43

Epoch:   4/10, Step:  401/6232, Loss:  3.967, Seconds: 61.81

Epoch:   4/10, Step:  601/6232, Loss:  3.605, Seconds: 62.24

Epoch:   4/10, Step:  801/6232, Loss:  3.959, Seconds: 62.74

Epoch:   4/10, Step: 1001/6232, Loss:  3.713, Seconds: 62.71

Epoch:   4/10, Step: 1201/6232, Loss:  3.906, Seconds: 62.19

Epoch:   4/10, Step: 1401/6232, Loss:  3.877, Seconds: 62.37

Epoch:   4/10, Step: 1601/6232, Loss:  3.759, Seconds: 61.84

Epoch:   4/10, Step: 1801/6232, Loss:  3.970, Seconds: 62.43

Epoch:   4/10, Step: 2001/6232, Loss:  4.039, Seconds: 61.96

Epoch:   4/10, Step: 2201/6232, Loss:  3.613, Seconds: 62.20

Epoch:   4/10, Step: 2401/6232, Loss:  3.777, Seconds: 62.03

Epoch:   4/10, Step: 2601/6232, Loss:  3.690, Seconds: 63.03

Epoch:   4/10, Step: 2801/6232, Loss:  3.738, Seconds: 62.14

Epoch:   4/10, Step: 3001/6232, Loss:  3.760, Seconds: 62.21

Epoch:   4/10, Step: 3201/6232, Loss:  4.226, Seconds: 61.83

Epoch:  

KeyboardInterrupt: 

In [186]:
j=20
num_display = 5
for i in range(num_display):
    print('p:', model_predictions[i+j])
    print('y:', ground_truth[i+j])
    print()

p: i do n't know . 
y: speed -- two hundred , seventy-five thousand kilometers per second . 

p: i do n't know . 
y: maybe later in the week . first i 've got to find myself a job . 

p: i do n't know . 
y: honest ? 

p: i 'm not sure you 're going to be able to get a chance . 
y: no , no you 're right , i 'm sorry . he uses women ; he lets them kill 

p: i 'm sorry . 
y: come in , come in . 



### Attention-based Encoder-Decoder Model Conclusion
It toke almost 6 hrs to train, but only responses "i'm sorry", "i don't know"...

# Selective Model

In [2]:
questions_train_path = 'data/cornell/questions_train.txt'
answers_train_path = 'data/cornell/answers_train.txt'
questions_test_path = 'data/cornell/questions_test.txt'
answers_test_path = 'data/cornell/answers_test.txt'

questions_train, answers_train, questions_test, answers_test = [], [], [], []

dataset_list = [questions_train, answers_train, questions_test, answers_test]
path_list = [questions_train_path, answers_train_path, questions_test_path, answers_test_path]

for dataset, path in zip(dataset_list, path_list):
    with open(path, 'r') as f:
        for line in f:
            dataset.append(line.strip())

In [3]:
def text_prepare(text):
    """Performs tokenization and simple preprocessing."""
    
    replace_by_space_re = re.compile('[/(){}\[\]\|@,;]')
    bad_symbols_re = re.compile('[^0-9a-z #+_]')
    stopwords_set = set(stopwords.words('english'))
    tags_re = re.compile('<[^>]*>')
    arrow_re = re.compile('[<>]')

    text = text.lower()
    text = replace_by_space_re.sub(' ', text)
    text = bad_symbols_re.sub('', text)
    text = tags_re.sub('', text)
    text = arrow_re.sub('', text)
    text = ' '.join([x for x in text.split() if x and x not in stopwords_set])

    return text.strip()

In [5]:
question_answer_train = []
question_ids = []
for i in range(len(questions_train)):
    question = text_prepare(questions_train[i])
    answer = text_prepare(answers_train[i])
    if ((len(question)>0) & (len(answer)>0)):        
        question_answer_pair = question + '\t' + answer
        question_answer_train.append(question_answer_pair)
        question_ids.append(i)

In [49]:
path = 'data/cornell/question_answer_train.tsv'
out = open(path, 'w')
for line in question_answer_train:
    line = line.strip().split('\t')
#     new_line = [text_prepare(q) for q in line]
    print(*line, sep='\t', file=out)
out.close()

In [1]:
!starspace train -trainFile "data/cornell/question_answer_train.tsv" -model starspace_embedding \
-trainMode 3 \
-adagrad true \
-ngrams 1 \
-epoch 5 \
-dim 100 \
-similarity "cosine" \
-minCount 2 \
-verbose true \
-fileFormat labelDoc \
-negSearchLimit 10 \
-lr 0.05 \
-thread 4

Arguments: 
lr: 0.05
dim: 100
epoch: 5
maxTrainTime: 8640000
saveEveryEpoch: 0
loss: hinge
margin: 0.05
similarity: cosine
maxNegSamples: 10
negSearchLimit: 10
thread: 4
minCount: 2
minCountLabel: 1
label: __label__
ngrams: 1
bucket: 2000000
adagrad: 1
trainMode: 3
fileFormat: labelDoc
normalizeText: 0
dropoutLHS: 0
dropoutRHS: 0
Start to initialize starspace model.
Build dict from input file : data/cornell/question_answer_train.tsv
Read 2M words
Number of words in dictionary:  49914
Number of labels in dictionary: 0
Loading data from file : data/cornell/question_answer_train.tsv
Total number of examples loaded : 179611
Initialized model weights. Model size :
matrix : 49914 100
Training epoch 0: 0.05 0.01
Epoch: 100.0%  lr: 0.040000  loss: 0.077906  eta: 0h3m  tot: 0h0m45s  (20.0%)  lr: 0.049721  loss: 0.124236  eta: 0h4m  tot: 0h0m1s  (0.6%)48.9%  lr: 0.044525  loss: 0.087234  eta: 0h3m  tot: 0h0m22s  (9.8%)52.4%  lr: 0.044414  loss: 0.086244  eta: 0h3m  tot: 0h0m24s  (10.5%)54.4%  lr

In [11]:
def load_embeddings(embeddings_path):
    embeddings = {}
    for line in open(embeddings_path):
        word, *arr = line.split('\t')
        embeddings[word] = np.asarray(arr, dtype='float32')
        
    dim = len(arr)
    
    return embeddings, dim

In [12]:
starspace_embeddings, embeddings_dim = load_embeddings('starspace_embedding.tsv')

In [13]:
def question_to_vec(question, embeddings, dim):
    question2vec = [embeddings[word] for word in question.split() if word in embeddings]
    
    if not question2vec:
        return np.zeros(dim)
    
    question2vec = np.array(question2vec)
    
    return question2vec.mean(axis=0)

In [18]:
import pickle
question_matrix = np.zeros((len(question_answer_train), embeddings_dim), dtype=np.float32)

for i, question in enumerate(question_answer_train):
    question = question.split('\t')[0]
    question_matrix[i, :] = question_to_vec(question, starspace_embeddings, embeddings_dim)
file_name = 'question_matrix.pkl'
pickle.dump(question_matrix, open(file_name, 'wb'))

In [22]:
answer_pairs = []
for q_id in question_ids:
    answer_pairs.append(answers_train[q_id])

In [28]:
answer_pairs_path = 'answer_pair.txt'
out = open(answer_pairs_path, 'w')
for line in answer_pairs:
    print(line, sep='\t', file=out)
out.close()

In [30]:
from sklearn.metrics.pairwise import pairwise_distances_argmin

def get_best_answer(question, word_embeddings, embeddings_dim, question_matrix, answer_pairs):
        """ Returns id of the most similar thread for the question.
            The search is performed across the threads with a given tag.
        """
        # HINT: you have already implemented a similar routine in the 3rd assignment.
        
        question = text_prepare(question)
        question_vec = question_to_vec(question, word_embeddings, embeddings_dim)
        best_answer_id = pairwise_distances_argmin(question_vec.reshape(1, -1), question_matrix)[0]
        
        return answer_pairs[best_answer_id]

### Evaluate Selective Model

In [50]:
questions = ["hi", "nice to meet you", "good to see you", "let's go party",
             "bye", "who are you?", "get out", "It's imperative",
             "why so serius?", "Where did he go?", "What does it cost?"]

for q in questions:
    answer = get_best_answer(q, starspace_embeddings, embeddings_dim, question_matrix, answer_pairs)
    print('Q:', q)
    print('A:', answer)
    print('')

Q: hi
A: We had a date.

Q: nice to meet you
A: Peace out, Craig.

Q: good to see you
A: Where's Plissken?

Q: let's go party
A: Of course, birthday and welcome home... who'll I ask?

Q: bye
A: Bye.

Q: who are you?
A: What'd you see, who was she with, where were they going?

Q: get out
A: I'm sure it won't be long now.

Q: It's imperative
A: Just cover me. It was built to move.

Q: why so serius?
A: What'd you see, who was she with, where were they going?

Q: Where did he go?
A: Okay.

Q: What does it cost?
A: You want sophistication, it don't come cheap.



### Conclustion
The model sometimes responses weirdly, but still much better than the Attention-based Seq2Seq model above. And it didn't take so long to train the Starspace word embeddings.