In [1]:
from IPython.display import Image

# Smart Reply: Automated Response Suggestion for Email

**My Notes:**

* Most of the text in this notebook is taken directly from the [paper](https://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf) with the exception of this section.
* Overview of Google Smart Reply architecture, the challenges faced and innovations used to overcome these challenges.
* One key innovation is the development of a response set using semi-supervised learning. This vastly impoves the quality and utility of the suggested responses. 
    + Step 1: Canonicalize email responses. Each sentence is parsed using a dependency parser and it's syntactic structure is used to generate a canoncialized representation.
        + Responses are limited to responses with 10 or less tokens
    + Step 2: Semi-supervised learning with scalable graph algorithms is used to construct semantic clusters. Specifically the Expander Graph Algorithm.
    + Step 3: 100 random samples are drawn from each of the learned clusters and curated by humans to validate their accuracy. 
* Triggering model improves scalabilty of the model by removing 90% of emails where a response suggestion would not be useful. 
    

## Abstract

This paper gives an overview of the Smart Reply system developed by Google allowing users to respond to an email with a single tap. The system presents with three options to choose from and is responsible for 10% of all email replies on mobile. The system is designed to process 100s of millions of emails a day. 

## System Architecture

Below is an image from the paper showing the system architecture. 

**Overview:**:

* Based on sequence-to-sequecne learning architecture which uses LSTMs to predict sequences of text. 
* Consistent with the approach of the [Neural Conversational Model](https://arxiv.org/pdf/1506.05869.pdf)
* Model input sequence is an incoming message and the output is a distribution of the space of possible replies.

**Main Challenges:**

* **Response Quality**: How to ensure that the individual response options are always high quality in langauge and content
* **Utility**: How to select multiple options to show a user to maximize the likelihood that one is chosen
* **Scalability**: How to efficiently process millions of messages per day while remaining within the latency requirements of an email delivery system
* **Privacy**: How to develop the system without ever inspecting the data except for aggregate statisticis. 

**Smart Reply Components:**

* **Response Selection:** At the core of the system an LSTM neural network processes an incoming message and uses it to predict the most likely responses. Scalability is improved by only finding the approximate best responses. 
* **Response Set Generation:** To deliver high response quality, responses are only selected from a response space which is generated offline using a semi-supervised graph learning approach.
* **Diversity:** After finding a set of most likely responses from the LSTM, choose a small set to show to the users which maximize the utility of one being chosen. 
* **Triggering Model:** A feed-forward neural network decides whether or not to suggest responses. This further improves utility by not showing suggestions when they are unlikely to be used. This is broken out into a separate architecture for scalability allowing them to use a computationally cheaper architecture than that used for the scoring model. 

![Smart Reply Architecture](./img/google_smart_reply.png)

## Selecting Responses

The fundamental task of smart reply is to find the most likely response given an original message. In other words, given original message $\textbf{o}$ and the set of all possible responses $R$, we would like to find: 

$$
\textbf{r}^* = argmax P(\textbf{r}|\textbf{o})
$$

To find this response, they construct a model that an score responses and then find the highest scoring response. 

### LSTM Model

Scoring one sequence of tokens **r**, conditional on another sequence of tokens **o**, this problem is a natural fit for sequence-to-sequence learning. The model itself is an LSTM, the input is the tokens of the original message ${0_1, ..., 0_n}$, and the output is the conditional probability distribution of the sequence of response tokens given the input:

$$
P(r_1, ..., r_m | o_1, ... o_n)
$$

This distribution is factorized as 

$$
P(r_1, ..., r_m | o_1, ... o_n) = \Pi_{i=1}^{m} P(r_i|o1, ..., o_n, r1, ..., r_{i-1})
$$

EOS token is included with the original input message so that the LSTM's hidden state encodes a vector representation of the whole message. Given the hidden state, a softmax output is computed and interpreted as $P(r_1|o_1, ..., o_n)$ or the probability distribution of the first response token. 

#### Training

The training objective is to maximize the log probability of observed responses, given their respective originals:

$$
\sum_{(\textbf{o}, \textbf{r})} log P(r_1, ..., r_m | o_1, ..., o_n)
$$

Optimized with Adagrad over 10 epochs. In addtion to the standard LSTM formulation the addition of a [recurrent projection layer](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43905.pdf) significantly improved both the quality and time to converge of the proposed model.

Gradient clipping (with a value of 1) was also essential in ensuring stable results. 

#### Inference

At inference time, the original message is fed in and the output of the softmaxes is used to get a probability distribtuion at each time step. These probability distributions can be used in different ways:

1. To draw a random sample from the response distribution $P(r_1, ..., r_m | o_1, ..., o_n)$. This can be done by sampling one token at each timestep and feeding it back into the model. 
2. To approximate the most likely response given the original message. This can be done greedily by taking the most likely token at each time step and feeding it back in. A less greedy strategy is to use a beam search, i.e., take the top _b_ tokens and feed them in, then retain the _b_ best response prefixes and repeat. 
3. To determine the likelihood of a specific response candidate. This can be done by feeding in each token of the candidate and using the softmax output to get the likelihood of the next candidiate token. 

## Challeneges 

### Response Quality

Need to ensure that they are always high quality in style, tone, diction, and content. Model is trained on a corpus of real messages, need to account for the possibility that the most probable response is not necessarily a high quality response. Even a response that occurs frequently in our corpus may not be appropriate to surface back to the users. For example, it could contain poor grammar. spelling, or mechanics. Restricting the vocabulary might address simple cases such as profanity and spelling errors, it would not be sufficient to capture the wide variability with which politically incorrect statements can be made. Instead, a semi-supervised learning approach is used to construct a target response space R comprising only high quality responses. Then the model is used to select the best response in R, rather than the best response from any sequence of words. 

### Utility

To improve specificity of responses, we apply some light normalization that penalizes responses which are applicable to a broad range of incoming messages. Utility is further improved by first passing incoming messages through a triggering model to determine whether smart reply suggestions should be shown at all. 

### Scalability

Model can not introduce latency to the process of email delivery so scability is critical. Exhaustively scoring every response candidate $r \in R$ would require $O(|R|l)$ LSTM steps where $l$ is the length of the longest response.  R is also expected to grow over time and become very large given the tremendous diversity with which people communicate. In a uniform sample of 10 million short responses (10 tokens or less), more than 40% occur only once. Therefore rather than perform an exhaustive scoring of every candidate $r \in R$, we would like to efficiently search for the best responses such that complexity is not a function of $|R|$. 

First the elements of R are organized into a trie then a left-to-right beam search is conducted but only retain hypotheses which appear in the trie.  This search process has complexity $O(bl)$ for beam size $b$ and maximum response lenght $l$. Both $b$ and $l$ are typically in the range of 10-30, so this method dramatically reduces the time to find the top responses and is a critical element of making this system deployable. 

## Response Set Generation

Two of the core challenges we face when building the end to end automated response system are response quality and utility. Response quality comes from suggesting high quality responses that deliver a positive user experience. Utility, comes from ensuring that we don't suggest multiple responses that capture the same intent (for example, minor lexical variations such as "Yes, I'll be there. " and "I will be there.".

First need to define a target repsonse space that comprises high quality messages which can be surfaced as suggestions. The goal here is to set generate a structured response set that effectively captures various intents conveyed by people in natural langauge conversations. The target response space should capture variability in both langauage and intents.  The result is used in two ways downstream:

1. define a response space for scoring and selecting suggestions using the LSTM model previously described. 
2. Promote diversity among chosen suggestions

Response set is constructed using only the most frequent anonymized sentence saggregated from the preprocessed data. This process yields a few million unique sentences. 

### Canonicalizing email responses

First step is to automatically generate a set of canonical response messages that capture the variability in langauge. Each sentence is parsed using a dependence parser and uses its syntactic structure to generate a canonicalized representation. Words or phrases that are modifiers or unattached to head words are ignored. 

### Semantic intent clustering

In the next step, responses are partitioned into semantic clusters where a cluster represents a meaningful response intent. All messages within a cluser share the same semantic meaning but may appear very different. This step helps to automatically digest the entire information present in frequent responses into a coherent set of semantic clusters. If we were to build a semantic intent prediction model for this purpose, we would need access to a large corpus of sentences annotated with their correpsonding semantic intents. This is neither practical nor feasible so instead this task is modeled as a semi-supervised machine learning problem and use scalable graph algorithms to automatically learn this information from data and a few human-provided examples.

### Graph Construction

We start with a few manually defined clusters samples from the top frequent messages (e.g., thanks, i love you, sounds good.) A small number of examples responses are added as seeds for each cluster. 

A base graph is then constructed with frequent response messages as nodes ($V_R$). For each response message, we further extract a set of lexical features (ngrams and skip-grams of length up to 3) and add these as feature nodes ($V_F$) to the same graph. Edges are created between a pair of nodes ($u,v$) where $u \in V_R$ and $v \in V_F$ if $v$ belongs to the feature set for response $u$. We follow the same process and create nodes for the manually labelled examples $V_L$. Incoming messages could also be treated as responses to another email depending on the context. Inter-message relations as shown in the above example can be modeled within the same framework by adding extra edges between the corresponding message nodes in the graph. 

### Semi-Supervised.

The constructed graph captures relationships between similar canonicalized responses via the feature nodes. Next we learn a semantic labeling for all the response nodes by propagating semantic intent information from the manually labeled examples through the graph. we treat this as a supervised semi-supervised learning problem and use the distributed [EXPANDER](http://proceedings.mlr.press/v51/ravi16.pdf) framework for optimization. The learning framework is scalable and nautrally suited for semi-supervised graph propagation tasks such as the semantic clustering problem described here. The following objective function is minimized  for the response nodes in the graph:

$$
s_i||\hat{C_i} - C_i||^2 + \mu_{pp}||\hat{C_i} - U||^2 + \mu_{np}(\sum_{j \in N_F(i)}w_{ij}||\hat{C_i} - \hat{C_j}||^2 + \sum_{j \in N_R(i)} w_{ik}||\hat{C_i} - \hat{C_k}||^2 )
$$

Where,

* $s_i$ is an indicator function equal to 1 if the node $i$ is a seed and 0 otherwise
* $\hat{C_i}$ is the learned semantic cluster distribution for response node $i$.
* $C_i$ is the true label distribution (i.e., for the manually provided examples)
* $N_F(i)$ and $N_R(i)$ represent the feature and message neighborhood of the node $i$
* $\mu_{np}$ is the predefined penalty for neighboring nodes with divergent label distribution
* $\hat{C_j}$ is the learned label distribution for feature neighbor j
* $w_{ij}$ is the weight of feature $j$ in response $i$
* $\mu_{pp}$ is the penalty for label distribution deviating from the prior, a uniform distributiion U

The objective function for the feature nodes is alike except there is no first term, as there are no seed labels for feature nodes:

$$
\mu_{np}\sum_{i \in N_F(j)}w_{ij}||\hat{C_j} - \hat{C_i} ||^2  + \mu_{pp}||\hat{C_j} - U||^2
$$

The objective function is jointly optimized for all nodes in the graph.

The output from EXPANDER is a learned distribution of semantic labels for every node in the graph. We assign the top scoring output label as the semantic intent for the node, labels with low scores are filtered out. 

![Expander Algorithm](./img/google_expander.png)

To discover new clusters which are not covered by the labeled examples, we run the semi-supervised learning algorithm in repeated phases. In the first phase, we run the label propagation algorithm for 5 iterations. The process is run in phases:

1. Run label propagation algorithm for 5 iterations. Then fix cluster assignment and randomly sample 100 new responses from the remaining unlabeled nodes in the graph.
    + The sampled nodes are treated as potential new clusters and labeled with their canonicalized representation. 
2. Rerun label propagation with the new labeled set of clusters and repeat this procedure until convergence (i.e., until no new clusters are discovered and members of a cluster do not change between iterations.)
3. The iterative propagation method allows us to both expand cluster membership as well as discover new clusters where each cluster has an interpretable semantic meaning.

### Cluster Validation

Finally, we extract the top k members for each semantic cluster, sorted by their label scores. The set of (response, cluster label) pairs are then validated by human raters. The raters are provided with a response $R_i$ a corresponding cluster label $C$ (e.g., thanks) as well  as a few exmaple responses belonging to the cluster (e.g., "THanks!", "Thank you") and asked whether $R_i$ belongs to $C$. 

The result is an automatically generated and validated set of high quality response messages labeled with semantic intent. This is subsequently used by the response scoring model to search for approximate best responses to an incoming email and further to enforce diversity among the top responses shown. 

## Suggestion Diversity

As discussed in Section 3, the LSTM first processes an incoming message and then selects the approximate best responses from the target response set created using the method described in section 4. Recall that we follow this by some light normalization to penalize responses that may be too general to be valuable to the user. The effect of this normalization can be seen by comparing columns 1 and 2 of Table 2. For example, the very generic, "Yes!" falls out of the top ten responses. 

If we simply select the top N responses there is a high probablity that all of these responses are very similar in meaning. The job of the diversity component is to select a more varied set of suggestions usiing two strategies: omittingi redundant responses and enforcing negative or positive responses. 

### Omitting Redundant Responses

This strategy assumes that the user should never see two responses of the same intent. Intents are defined by the clusters generated by the Expander Algo. The actual diversity stategy is simple: the top responses are iterated over in the order of decreasing score. Each response is added to the list of suggestions unless it's intent is already covered by a response on the suggestion list. 

### Enforcing Negatives and Positives

We have observed that LSTMs have a strong tendency towards producing positive responses, whereas negative responses usch as I can't make it or I don't think so typically recieve lower scores. There is often utility in including a negative option in the list of suggestions. 

If the top two responses (after omitting redundant repsonses) contain at least one positive response and none of the top 3 responses are negative, the third response is replaced by a negative one. 

In order to find the negative response,  a second LSTM pass is performed. In this second pass, the search is restricted to only the negative responses in teh target set. This is necessary since the top responses produced in the first pass may not contain any negatives. In situations where the resposnes are all negative an analogous strategy is employed for enforcing at least 1 postive response.

## Triggering

The triggering module is the entry point of the smart reply system. It is responsible for filterinig messages that are bad candidates for suggesting responses. This includes emails for which short replies are not appropriate, as well as emails for which no reply is necesasry at all. 

The module isi applied to every incoming email just after the preprocessing step. If the decision is negative, execution is finished and no suggestions are shown. Smart reply currently produces responses for around 11% of all emails so this system vastly reduces the number of useless suggestions seen by users. 

The main part of the triggering component is a feed forward neural network which produces a probability score for every incoming message. If the score is above some threshold we trigger and run the LSTM scoring. 

### Data and Features

In order to label our training corpus of emails, we use as positive examples those emails that been responded to. More precisely, out of the data set described in Section 7.1, we create a training set that consists of pairs ($\textbf{o}, y$) where $\textbf{o}$ is an incoming message and $y \in {true, false}$ is a boolean label, which is true if the message had a response and false otherwise. For the positive class, we consider only messages that were replied to from a mobile device, while for negative we use a subset of all messages. We downsample the negative class to balance the training set. Our goal is to model $P(y = true|\textbf{o})$. The probability that message $\textbf{o}$ will have a response on mobile. 

After preprocessing we extract content features (e.g., unigrams, bigrams) from the message body, subject and headers. We also use various social signals like whether the sender is in recipients address book, whether the sender is in recipients social network and whether the recipient responded in the past to this sender. 

### Network Architecture and Training

We use a feedforward multilayer perceptron with an embedding layer (for a vocabulary of roughly one million words) and three fully connected hidden layers. We use feature hashing to bucket rare words that are not present in the vocabulary. The embeddings are separate for each sparse feature type (eg., unigram, bigram) and within one feature type, we aggregate embeddings by summing them up. Then, all sparse feature embeddings are concatenated with each other and with the vector of dense features. 

We use ReLu activation function for non-linearity between the layers. The dropout layer is applied after each hidden layer. Model is trained using Adagrad with the logistic loss cost function. 