# DEEP LEARNING FOR CHATBOTS 

* 딥러닝 세미나 : 코드리뷰[1,2]
* 김무성

이 리뷰 자료는 다음의 자료들과 코드를 거의 99.9% 활용함
* [1] DEEP LEARNING FOR CHATBOTS, PART 1 – INTRODUCTION - http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/
* [2] DEEP LEARNING FOR CHATBOTS, PART 2 – IMPLEMENTING A RETRIEVAL-BASED MODEL IN TENSORFLOW - http://www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/
* [8] code - https://github.com/dennybritz/chatbot-retrieval/
* [9] data - https://drive.google.com/open?id=0B_bZck-ksdkpVEtVc1R6Y01HMWM

# Contents
* PART 1 – INTRODUCTION
    - Chatbots
    - A TAXONOMY OF MODELS
        - RETRIEVAL-BASED VS. GENERATIVE MODELS
        - LONG VS. SHORT CONVERSATIONS
        - OPEN DOMAIN VS. CLOSED DOMAIN
    - COMMON CHALLENGES
        - INCORPORATING CONTEXT
        - COHERENT PERSONALITY
        - EVALUATION OF MODELS
        - INTENTION AND DIVERSITY
    - HOW WELL DOES IT ACTUALLY WORK?
    - UPCOMING & READING LIST
* PART 2 – IMPLEMENTING A RETRIEVAL-BASED MODEL IN TENSORFLOW
    - RETRIEVAL-BASED BOTS
    - THE UBUNTU DIALOG CORPUS
    - BASELINES
    - DUAL ENCODER LSTM
    - DATA PREPROCESSING
    - CREATING AN INPUT FUNCTION
    - DEFINING EVALUATION METRICS
    - BOILERPLATE TRAINING CODE
    - CREATING THE MODEL
    - EVALUATING THE MODEL
    - MAKING PREDICTIONS
    - CONCLUSION
* Code 

# PART 1 – INTRODUCTION
* Chatbots
* A TAXONOMY OF MODELS
* COMMON CHALLENGES
* HOW WELL DOES IT ACTUALLY WORK?
* UPCOMING & READING LIST

## Chatbots

* big bets - http://www.bloomberg.com/features/2016-microsoft-future-ai-chatbots/
* Operator - https://operator.com/
* x.ai - https://x.ai/
* Chatfuel - http://chatfuel.com/
* Howdy’s Botkit - http://howdy.ai/botkit/
* Microsoft Bot Framework - https://dev.botframework.com/

## A TAXONOMY OF MODELS
* RETRIEVAL-BASED VS. GENERATIVE MODELS
* LONG VS. SHORT CONVERSATIONS
* OPEN DOMAIN VS. CLOSED DOMAIN

### RETRIEVAL-BASED VS. GENERATIVE MODELS

* Retrieval-based models (easier)
    - Retrieval-based models (easier) use a repository of predefined responses and some kind of heuristic to pick an appropriate response based on the input and context.
    - These systems don’t generate any new text, they just pick a response from a fixed set.
* Generative models (harder)
    - Generative models (harder) don’t rely on pre-defined responses. They generate new responses from scratch.
    <img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2016/04/nct-seq2seq.png" width=600 />
* Deep Learning
    - Deep Learning techniques can be used for both retrieval-based or generative models, but research seems to be moving into the generative direction.
    - Sequence to Sequence model - http://arxiv.org/abs/1409.3215

### LONG VS. SHORT CONVERSATIONS

* Short-Text Conversations (easier)
* long conversations (harder)

### OPEN DOMAIN VS. CLOSED DOMAIN

* open domain (harder)
    - There isn’t necessarily have a well-defined goal or intention.
    - Conversations on social media sites like 
        - Twitter and 
        - Reddit are typically open domain 
            - they can go into all kinds of directions.
* closed domain (easier) 
    - Technical Customer Support or 
    - Shopping Assistants
        - These systems don’t need to be able to talk about politics, 
        - they just need to fulfill their specific task as efficiently as possible.

## COMMON CHALLENGES
* INCORPORATING CONTEXT
* COHERENT PERSONALITY
* EVALUATION OF MODELS
* INTENTION AND DIVERSITY

### INCORPORATING CONTEXT

* linguistic context and physical context
* embed - https://en.wikipedia.org/wiki/Word_embedding
* Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models - http://arxiv.org/abs/1507.04808
* Attention with Intention for a Neural Network Conversation Model - http://arxiv.org/abs/1510.08565

### COHERENT PERSONALITY

* “How old are you?” and “What is your age?”
* A Persona-Based Neural Conversation Model - http://arxiv.org/abs/1603.06155
    <img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2016/04/Screen-Shot-2016-04-04-at-6.36.59-PM.png" />

### EVALUATION OF MODELS

* BLEU - https://en.wikipedia.org/wiki/BLEU
    - Common metrics such as BLEU that are used for Machine Translation and are based on text matching aren’t well suited because sensible responses can contain completely different words or phrases.
    - 참고자료
        - [3] Chapter 8. Evaluation (Statistical Machine Translation) - http://www.statmt.org/book/slides/08-evaluation.pdf
        - [4] BLEU (wikipedia) - https://en.wikipedia.org/wiki/BLEU
        - [5] BLEU ("Show and Tell: A Neural Image Caption Generator (CVPR 2015) slide") - https://docs.com/hana-lee/7849/show-and-tell-a-neural-image-caption-generator
        - [6] BLEU (5. blue slide) - http://www.slideshare.net/hiroshimatsumoto750/5-bleu
        - [7] Source code for nltk.align.bleu - http://www.nltk.org/_modules/nltk/align/bleu.html    
* How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation - http://arxiv.org/abs/1603.08023

### INTENTION AND DIVERSITY

* A common problem with generative systems is that they tend to produce generic responses like 
    - “That’s great!” or 
    - “I don’t know” that work for a lot of input cases.
* Early versions of Google’s Smart Reply tended to respond with “I love you” to almost anything. 
    - Computer, respond to this email - https://research.googleblog.com/2015/11/computer-respond-to-this-email.html
*  Some researchers have tried to artificially promote diversity through various objective functions
    - A Diversity-Promoting Objective Function for Neural Conversation Models - http://arxiv.org/abs/1510.03055

## HOW WELL DOES IT ACTUALLY WORK?

* A retrieval-based open domain system 
    - is obviously impossible because you can never handcraft enough responses to cover all cases. 
* A generative open-domain system 
    - is almost Artificial General Intelligence (AGI) because it needs to handle all possible scenarios. We’re very far away from that as well (but a lot of research is going on in that area).
* This leaves us with problems in restricted domains where both generative and retrieval based methods are appropriate. 
* The longer the conversations and the more important the context, the more difficult the problem becomes.
* In a recent interview, Andrew Ng
    - Baidu research chief Andrew Ng fixed on self-taught computers, self-driving cars - http://www.seattletimes.com/business/baidu-research-chief-andrew-ng-fixed-on-self-taught-computers-self-driving-cars/
    - "<font color="red">Most of the value of deep learning today is in narrow domains where you can get a lot of data. Here’s one example of something it cannot do: have a meaningful conversation. There are demos, and if you cherry-pick the conversation, it looks like it’s having a meaningful conversation, but if you actually try it yourself, it quickly goes off the rails.</font>"
* human workers & automate & big data
* Grammatical mistakes
    - retrieval-based models
        - That’s why most systems are probably best off using retrieval-based methods that are free of grammatical errors and offensive responses. 
    - generative models
        - If companies can somehow get their hands on huge amounts of data then generative models become feasible
        -  but they must be assisted by other techniques to prevent them from going off the rails like Microsoft’s Tay did.
            - Microsoft is deleting its AI chatbot's incredibly racist tweets - http://www.businessinsider.com/microsoft-deletes-racist-genocidal-tweets-from-ai-chatbot-tay-2016-3

## UPCOMING & READING LIST

* Neural Responding Machine for Short-Text Conversation (2015-03) 
    - http://arxiv.org/abs/1503.02364
* A Neural Conversational Model (2015-06)
    - http://arxiv.org/abs/1506.05869
* A Neural Network Approach to Context-Sensitive Generation of Conversational Responses (2015-06)
    - http://arxiv.org/abs/1506.06714
* The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems (2015-06)
    - http://arxiv.org/abs/1506.08909
* Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models (2015-07)
    - http://arxiv.org/abs/1507.04808
* A Diversity-Promoting Objective Function for Neural Conversation Models (2015-10)
    - http://arxiv.org/abs/1510.03055
* Attention with Intention for a Neural Network Conversation Model (2015-10)
    - http://arxiv.org/abs/1510.08565
* Improved Deep Learning Baselines for Ubuntu Corpus Dialogs (2015-10)
    - http://arxiv.org/abs/1510.03753
* A Survey of Available Corpora for Building Data-Driven Dialogue Systems (2015-12)
    - http://arxiv.org/abs/1512.05742
* Incorporating Copying Mechanism in Sequence-to-Sequence Learning (2016-03)
    - http://arxiv.org/abs/1603.06393
* A Persona-Based Neural Conversation Model (2016-03)
    - http://arxiv.org/abs/1603.06155
* How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation (2016-03)
    - http://arxiv.org/abs/1603.08023

# PART 2 – IMPLEMENTING A RETRIEVAL-BASED MODEL IN TENSORFLOW
* RETRIEVAL-BASED BOTS
* THE UBUNTU DIALOG CORPUS
* BASELINES
* DUAL ENCODER LSTM
* DATA PREPROCESSING
* CREATING AN INPUT FUNCTION
* DEFINING EVALUATION METRICS
* BOILERPLATE TRAINING CODE
* CREATING THE MODEL
* EVALUATING THE MODEL
* MAKING PREDICTIONS
* CONCLUSION

#### part2 tutorial info
* This code uses Python 3 and Tensorflow >= 0.9.
* numpy, scikit-learn, pandas
* code - https://github.com/dennybritz/chatbot-retrieval/
* data - https://drive.google.com/open?id=0B_bZck-ksdkpVEtVc1R6Y01HMWM    

## RETRIEVAL-BASED BOTS

* In this post we’ll implement a retrieval-based bot. 
* The vast majority of production systems today are retrieval-based, or a combination of retrieval-based and generative.

## THE UBUNTU DIALOG CORPUS

* Ubuntu Dialog Corpus 
    - paper : The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems - http://arxiv.org/abs/1506.08909
    - github - https://github.com/rkadlec/ubuntu-ranking-dataset-creator

#### training data

* The training data consists of 
    - 1,000,000 examples, 
        - 50% positive (label 1) and 
        - 50% negative (label 0). 
    - Each example consists of 
        - a context, 
            - the conversation up to this point, and 
        - an utterance, 
            - a response to the context. 
    - A positive label means that 
        - an utterance was an actual response to a context, and 
    - a negative label means that 
        - the utterance wasn’t
    - it was picked randomly from somewhere in the corpus. 

<a href="./notebooks/Data%20Exploration.ipynb" />Here is some sample data</a>

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2016/04/Screen-Shot-2016-04-20-at-12.29.42-PM.png" width=600 />

#### preprocessing

Note that the dataset generation script has already done a bunch of preprocessing for us
* tokenized - http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize
* stemmed - http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.snowball
* lemmatized - http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.wordnet
* using the NLTK tool - http://www.nltk.org/

#### test / validation set

* The format of these is different from that of the training data. 
* Each record in the test/validation set consists of 
    - a context, 
    - a ground truth utterance (the real response) and 
    - 9 incorrect utterances called 
        - distractors. 
* The goal of the model is to assign 
    - the highest score to the true utterance, and 
    - lower scores to wrong utterances.

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2016/04/Screen-Shot-2016-04-20-at-12.43.09-PM.png" width=600 />

#### recall@k

* he are various ways to evaluate how well our model does. 
* A commonly used metric is recall@k. 
* Recall@k means that 
    - we let the model pick 
        - the k best responses out of 
        - the 10 possible responses 
            - (1 true and 9 distractors). 
    - If the correct one is among the picked ones we mark that test example as correct. 
    - So, a larger k 
        - means that the task becomes easier. 
    - If we set k=10 
        - we get a recall of 100% 
            - because we only have 10 responses to pick from. 
    - If we set k=1 
        - the model has only one chance to pick the right response.

## BASELINES

In [None]:
# We’ll use the following function to evaluate our recall@k metric:
def evaluate_recall(y, y_test, k=1):
    num_examples = float(len(y))
    num_correct = 0
    for predictions, label in zip(y, y_test):
        if label in predictions[:k]:
            num_correct += 1
    return num_correct/num_examples

# Here, y is a list of our predictions 
# sorted by score in descending order,
# and y_test is the actual label. 
# For example, 
# a y of [0,3,1,2,5,6,4,7,8,9] 
# Would mean that the utterance number 0 got the highest score, 
# and utterance 9 got the lowest score.

# Remember that we have 10 utterances for each test example, 
# and the first one (index 0) is always the correct one 
# because the utterance column comes 
# before the distractor columns in our data.

#### baseline (Random Predictor)

Intuitively, a completely random predictor should get 
* a score of 10% for recall@1, 
* a score of 20% for recall@2, and so on. 

In [None]:
# Random Predictor
def predict_random(context, utterances):
    return np.random.choice(len(utterances), 10, replace=False)

# Evaluate Random predictor
y_random = [predict_random(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
y_test = np.zeros(len(y_random))
for n in [1, 2, 5, 10]:
    print("Recall @ ({}, 10): {:g}".format(n, evaluate_recall(y_random, y_test, n)))
    
#Recall @ (1, 10): 0.0937632
#Recall @ (2, 10): 0.194503
#Recall @ (5, 10): 0.49297
#Recall @ (10, 10): 1

#### baseline (TF-IDF Predictor)

Another baseline that was discussed in the original paper is a tf-idf predictor.
* tf-idf - https://en.wikipedia.org/wiki/Tf%E2%80%93idf
* scikit-learn - http://scikit-learn.org/

In [None]:
class TFIDFPredictor:
    def __init__(self):
        self.vectorizer = TfidfVectorizer()
 
    def train(self, data):
        self.vectorizer.fit(np.append(data.Context.values,data.Utterance.values))
 
    def predict(self, context, utterances):
        # Convert context and utterances into tfidf vector
        vector_context = self.vectorizer.transform([context])
        vector_doc = self.vectorizer.transform(utterances)
        # The dot product measures the similarity of the resulting vectors
        result = np.dot(vector_doc, vector_context.T).todense()
        result = np.asarray(result).flatten()
        # Sort by top results and return the indices in descending order
        return np.argsort(result, axis=0)[::-1]
# Evaluate TFIDF predictor
pred = TFIDFPredictor()
pred.train(train_df)
y = [pred.predict(test_df.Context[x], test_df.iloc[x,1:].values) for x in range(len(test_df))]
for n in [1, 2, 5, 10]:
    print("Recall @ ({}, 10): {:g}".format(n, evaluate_recall(y, y_test, n)))


#Recall @ (1, 10): 0.495032
#Recall @ (2, 10): 0.596882
#Recall @ (5, 10): 0.766121
#Recall @ (10, 10): 1

## DUAL ENCODER LSTM

* seq2seq model - https://www.tensorflow.org/versions/r0.9/tutorials/seq2seq/index.html
* Improved Deep Learning Baselines for Ubuntu Corpus Dialogs - http://arxiv.org/abs/1510.03753
* The Dual Encoder LSTM 
    - The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems - http://arxiv.org/abs/1506.08909
* library
    - numpy - http://www.numpy.org/
    - pandas - http://pandas.pydata.org/
    - Tensorflow - http://www.tensorflow.org/
    - TF Learn - https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/learn/python/learn

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2016/04/Screen-Shot-2016-04-21-at-10.51.18-AM.png" width=600 />

#### steps

It roughly works as follows:
1. Both the context and the response text 
    - are split by words, 
    - and each word is embedded into a vector. 
        - The word embeddings
            - are initialized with Stanford’s GloVe vectors and 
            - are fine-tuned during training 
2. Both the embedded context and response 
    - are fed into the same Recurrent Neural Network word-by-word. 
    - The RNN generates a vector representation, 
        - c and r in the picture
    - vector size : 256 dimensions.
3. We multiply c with a matrix M to “predict” a response r'. 
    - If c is a 256-dimensional vector, then 
    -  M is a 256×256 dimensional matrix, and 
    - the result is another 256-dimensional vector, 
        - which we can interpret as a generated response. 
    - The matrix M is learned during training.
4. We measure the similarity of 
    - the predicted response r' and the actual response r 
        - by taking the dot product of these two vectors. 
        - A large dot product means 
            - the vectors are similar and that 
            - the response should receive a high score. 
    - We then apply a sigmoid function 
        - to convert that score into a probability. 

#### cost function

* To train the network, we also need a loss (cost) function. 
* We’ll use the binary cross-entropy loss
* True label for a context-response pair y. 
    - This can be either 
        - 1 (actual response) or 
        - 0 (incorrect response)
* predicted probability y' (from step 4)
* Then, the cross entropy loss is calculated as 
    - L= −y * ln(y') − (1 − y) * ln(1−y)

## DATA PREPROCESSING

* dataset - https://github.com/rkadlec/ubuntu-ranking-dataset-creator
* Tensorflow’s proprietary <font color="orange">Example</font> format - https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto
    - There’s also tf.SequenceExample but it doesn’t seem to be supported by tf.learn yet

#### vocabulary
* As part of the preprocessing we also create a vocabulary. 
* This means we map each word to an integer number,
    - e.g. “cat” may become 2631. 
* The TFRecord files we will generate store these integer numbers instead of the word strings.

Each <font color="orange">Example</font> contains the following fields:
* context
    -  A sequence of word ids representing the context text, 
    - e.g. [231, 2190, 737, 0, 912]
* context_len
    - The length of the context, 
    - e.g. 5 for the above example
* utterance 
    - A sequence of word ids representing the utterance (response)
* utterance_len
    - The length of the utterance
* label 
    - Only in the training data. 
    - 0 or 1.
* distractor_[N]
    - Only in the test/validation data. 
    - N ranges from 0 to 8. 
    - A sequence of word ids representing the distractor utterance.
* distractor_[N]_len
    - Only in the test/validation data. 
    - N ranges from 0 to 8. 
    - The length of the utterance

#### python script

* <font color="orange">prepare_data.py</font>
    - The preprocessing is done by the prepare_data.py Python script
    - which generates 3 files:
        - train.tfrecords
        - validation.tfrecords
        - test.tfrecords  

## CREATING AN INPUT FUNCTION

<font color="orange">udc_inputs.py</font>

In [None]:
def input_fn():
  # TODO Load and preprocess data here
  return batched_features, labels

In [None]:
def create_input_fn(mode, input_files, batch_size, num_epochs=None):
  def input_fn():
    # TODO Load and preprocess data here
    return batched_features, labels
  return input_fn

The complete code can be found in <font color="orange">udc_inputs.py</font>. On a high level, the function does the following : 
1. Create a feature definition that describes the fields in our <font color="orange">Example</font> file
2. Read records from the <font color="blue">input_files</font> with <font color="blue">tf.TFRecordReader</font>
3. Parse the records according to the feature definition
4. Extract the training labels
5. Batch multiple examples and training labels
6. Return the batched examples and training labels

## DEFINING EVALUATION METRICS

recall@k metric

In [None]:
def create_evaluation_metrics():
  eval_metrics = {}
  for k in [1, 2, 5, 10]:
    eval_metrics["recall_at_%d" % k] = functools.partial(
        tf.contrib.metrics.streaming_sparse_recall_at_k,
        k=k)
  return eval_metrics

# functools.partial
#  - Above, we use functools.partial to convert a function 
#    that takes 3 arguments to one that only takes 2 arguments.

# streaming_sparse_recall_at_k
#  - Streaming just means that 
#    the metric is accumulated over multiple batches, 
#    and sparse refers to the format of our labels.

#### quiz

e.g. [0.34, 0.11, 0.22, 0.45, 0.01, 0.02, 0.03, 0.08, 0.33, 0.11]
* recall@1 ?
* recall@2 ?

## BOILERPLATE TRAINING CODE

<font color="orange">udc_train.py</font>

In [None]:
# Let’s assume we have a model function 
# model_fn 
# that takes as inputs our batched features, 
# labels and mode (train or evaluation) 
# and returns the predictions.

estimator = tf.contrib.learn.Estimator(
model_fn=model_fn,
model_dir=MODEL_DIR,
config=tf.contrib.learn.RunConfig())
 
input_fn_train = udc_inputs.create_input_fn(
mode=tf.contrib.learn.ModeKeys.TRAIN,
input_files=[TRAIN_FILE],
batch_size=hparams.batch_size)
 
input_fn_eval = udc_inputs.create_input_fn(
mode=tf.contrib.learn.ModeKeys.EVAL,
input_files=[VALIDATION_FILE],
batch_size=hparams.eval_batch_size,
num_epochs=1)
 
eval_metrics = udc_metrics.create_evaluation_metrics()
 
# We need to subclass theis manually for now. The next TF version will
# have support ValidationMonitors with metrics built-in.
# It's already on the master branch.
class EvaluationMonitor(tf.contrib.learn.monitors.EveryN):
def every_n_step_end(self, step, outputs):
  self._estimator.evaluate(
    input_fn=input_fn_eval,
    metrics=eval_metrics,
    steps=None)
 
eval_monitor = EvaluationMonitor(every_n_steps=FLAGS.eval_every)
estimator.fit(input_fn=input_fn_train, steps=None, monitors=[eval_monitor])


<font color="orange">hparams.py</font>

<font color="blue">hparams</font> is a custom object we create in hparams.py that holds hyperparameters, nobs we can tweak, of our model. This hparams object is given to the model when we instantiate it.

## CREATING THE MODEL

<font color="orange"> dual_encoder.py</font>

In [None]:
def dual_encoder_model(
    hparams,
    mode,
    context,
    context_len,
    utterance,
    utterance_len,
    targets):
 
  # Initialize embedidngs randomly or with pre-trained vectors if available
  embeddings_W = get_embeddings(hparams)
 
  # Embed the context and the utterance
  context_embedded = tf.nn.embedding_lookup(
      embeddings_W, context, name="embed_context")
  utterance_embedded = tf.nn.embedding_lookup(
      embeddings_W, utterance, name="embed_utterance")
 
 
  # Build the RNN
  with tf.variable_scope("rnn") as vs:
    # We use an LSTM Cell
    cell = tf.nn.rnn_cell.LSTMCell(
        hparams.rnn_dim,
        forget_bias=2.0,
        use_peepholes=True,
        state_is_tuple=True)
 
    # Run the utterance and context through the RNN
    rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
        cell,
        tf.concat(0, [context_embedded, utterance_embedded]),
        sequence_length=tf.concat(0, [context_len, utterance_len]),
        dtype=tf.float32)
    encoding_context, encoding_utterance = tf.split(0, 2, rnn_states.h)
 
  with tf.variable_scope("prediction") as vs:
    M = tf.get_variable("M",
      shape=[hparams.rnn_dim, hparams.rnn_dim],
      initializer=tf.truncated_normal_initializer())
 
    # "Predict" a  response: c * M
    generated_response = tf.matmul(encoding_context, M)
    generated_response = tf.expand_dims(generated_response, 2)
    encoding_utterance = tf.expand_dims(encoding_utterance, 2)
 
    # Dot product between generated response and actual response
    # (c * M) * r
    logits = tf.batch_matmul(generated_response, encoding_utterance, True)
    logits = tf.squeeze(logits, [2])
 
    # Apply sigmoid to convert logits to probabilities
    probs = tf.sigmoid(logits)
 
    # Calculate the binary cross-entropy loss
    losses = tf.nn.sigmoid_cross_entropy_with_logits(logits, tf.to_float(targets))
 
  # Mean loss across the batch of examples
  mean_loss = tf.reduce_mean(losses, name="mean_loss")
  return probs, mean_loss

Given this, we can now instantiate our model function in the main routine in <font color="orange">udc_train.py</font> that we defined earlier.

## EVALUATING THE MODEL

e.g. python udc_test.py --model_dir=~/github/chatbot-retrieval/runs/1467389151

Note that you must call udc_test.py with the same parameters you used during training. So, if you trained with --embedding_size=128 you need to call the test script with the same.

#### test results

After training for about 20,000 steps (around an hour on a fast GPU) our model gets the following results on the test set:

* While recall@1 is close to our TFIDF model, recall@2 and recall@5 are significantly better, suggesting that our neural network assigns higher scores to the correct answers. 
* The original paper reported 0.55, 0.72 and 0.92 for recall@1, recall@2, and recall@5 respectively, 
    - but I haven’t been able to reproduce scores quite as high.
    - Perhaps additional data preprocessing or hyperparameter optimization may bump scores up a bit more.

## MAKING PREDICTIONS

<font color="orange"> udc_predict.py</font>

e.g. python udc_predict.py --model_dir=./runs/1467576365/

You could imagine feeding in 100 potential responses to a context and then picking the one with the highest score.

## CONCLUSION

There is still a lot of room for improvement, however. One can imagine that other neural networks do better on this task than a dual LSTM encoder. There is also a lot of room for hyperparameter optimization, or improvements to the preprocessing step.

# Code

In [1]:
!ls

README.md  requirements.txt  udc_hparams.py  udc_model.py    udc_train.py
models	   run_catbot.ipynb  udc_inputs.py   udc_predict.py
notebooks  scripts	     udc_metrics.py  udc_test.py


# Data

Download the train/dev/test data 
* here - https://drive.google.com/open?id=0B_bZck-ksdkpVEtVc1R6Y01HMWM

and extract the acrhive into ./data.

In [2]:
!ls

README.md  notebooks	     scripts	     udc_inputs.py   udc_predict.py
data	   requirements.txt  udc.tar.gz      udc_metrics.py  udc_test.py
models	   run_catbot.ipynb  udc_hparams.py  udc_model.py    udc_train.py


# Training

In [4]:
# python udc_train.py

In [7]:
!python3 udc_train.py

INFO:tensorflow:No glove/vocab path specificed, starting with random embeddings.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Training steps [0,inf)
INFO:tensorflow:No glove/vocab path specificed, starting with random embeddings.
INFO:tensorflow:Restored model from /root/work/chatbot-retrieval/runs/1468168955/model.ckpt-0-?????-of-00001
INFO:tensorflow:Eval steps [0,inf) for training step 0.
INFO:tensorflow:Results after 10 steps (7.210 sec/batch): recall_at_5 = 0.46875, recall_at_1 = 0.13125, loss = 6.2582, recall_at_10 = 1.0, recall_at_2 = 0.175.
INFO:tensorflow:global_step/sec: 0.0082966
INFO:tensorflow:Results after 20 steps (6.092 sec/batch): recall_at_5 = 0.5125, recall_at_1 = 0.13125, loss = 4.45727, recall_at_10 = 1.0, recall_at_2 = 0.196875.
INFO:tensorflow:Results after 30 steps (5.952 sec/batch): recall_at_5 = 0.497916666667, recall_at_1 = 0.114583333333, loss = 6.53031, recall_at_10 = 1.0, recall_at_2 = 0.197916666667.
INFO:tensorflow:Results after 40 steps (5.805 sec

# Evaluation

In [5]:
# python udc_test.py --model_dir=...

In [10]:
!ls runs/1468168955

checkpoint				     model.ckpt-5795.meta
events.out.tfevents.1468168956.07a2d3d953a6  model.ckpt-5874-00000-of-00001
events.out.tfevents.1468200751.07a2d3d953a6  model.ckpt-5874.meta
graph.pbtxt				     model.ckpt-5957-00000-of-00001
model.ckpt-5713-00000-of-00001		     model.ckpt-5957.meta
model.ckpt-5713.meta			     model.ckpt-6001-00000-of-00001
model.ckpt-5795-00000-of-00001		     model.ckpt-6001.meta


In [11]:
!python3 udc_test.py --model_dir=runs/1468168955

INFO:tensorflow:No glove/vocab path specificed, starting with random embeddings.
INFO:tensorflow:Restored model from /root/work/chatbot-retrieval/runs/1468168955/model.ckpt-6001-?????-of-00001
INFO:tensorflow:Eval steps [0,inf) for training step 6001.
INFO:tensorflow:Results after 10 steps (5.981 sec/batch): recall_at_2 = 0.51875, recall_at_5 = 0.84375, loss = 0.572061, recall_at_1 = 0.34375, recall_at_10 = 1.0.
INFO:tensorflow:Results after 20 steps (5.865 sec/batch): recall_at_2 = 0.521875, recall_at_5 = 0.853125, loss = 0.52984, recall_at_1 = 0.346875, recall_at_10 = 1.0.
INFO:tensorflow:Results after 30 steps (5.801 sec/batch): recall_at_2 = 0.5375, recall_at_5 = 0.866666666667, loss = 0.636007, recall_at_1 = 0.3625, recall_at_10 = 1.0.
INFO:tensorflow:Results after 40 steps (5.775 sec/batch): recall_at_2 = 0.540625, recall_at_5 = 0.853125, loss = 0.548822, recall_at_1 = 0.3671875, recall_at_10 = 1.0.
INFO:tensorflow:Results after 50 steps (5.924 sec/batch): recall_at_2 = 0.535, re

# Evaluation

In [6]:
# python udc_predict.py --model_dir=...

In [12]:
!python3 udc_predict.py --model_dir=runs/1468168955

Context: Example context
Response 1: 0.634551
Response 2: 0.64461


# 참고자료 

* [1] DEEP LEARNING FOR CHATBOTS, PART 1 – INTRODUCTION - http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/
* [2] DEEP LEARNING FOR CHATBOTS, PART 2 – IMPLEMENTING A RETRIEVAL-BASED MODEL IN TENSORFLOW - http://www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/
* [3] Chapter 8. Evaluation (Statistical Machine Translation) - http://www.statmt.org/book/slides/08-evaluation.pdf
* [4] BLEU (wikipedia) - https://en.wikipedia.org/wiki/BLEU
* [5] BLEU ("Show and Tell: A Neural Image Caption Generator (CVPR 2015) slide") - https://docs.com/hana-lee/7849/show-and-tell-a-neural-image-caption-generator
* [6] BLEU (5. blue slide) - http://www.slideshare.net/hiroshimatsumoto750/5-bleu
* [7] Source code for nltk.align.bleu - http://www.nltk.org/_modules/nltk/align/bleu.html
* [8] code - https://github.com/dennybritz/chatbot-retrieval/
* [9] data - https://drive.google.com/open?id=0B_bZck-ksdkpVEtVc1R6Y01HMWM