Almost every NLP task can be casted into QA.

# Modules
Module overview
Input: in: input sentence. out: distributed representation
Question: in: 
Memory: 
Answer: in: last memory. out: task answer

## Input
input = word_embeddings([w_1, w_2, ... , w_n])

for x in input
    h_t = GRU(x, h_{t-1})
    
output = h if sentences(input) == 1 else {h_i | h_i was hidden state after EOS}

## Question
question = word_embeddings([qw_1, qw_2, ..., qw_n]) #The word embedding is shared for input and quesiton.

for x in question
    h_t = GRU(x, h_{t-1})
    
return h_T

## episodic memory

$h^i_t = gate(GRU(c_t, h^i_{t-1}), h^i_{t-1})$

$e^i = h^i_{T_c}$

$m^i = GRU(e^i, m^{i - 1})$

We eat one by one the words so as to compute a single representation for all the data. We use the hidden state produced after eating all the words to update the memory.

{what are the advantages of having the gate for $h_t^i$?}

It's interesting that we are taking more into account the most recent sentences. 

#### rnns and gates (for elements)
It seems that rnns and gates are related. We could see the gate as a rnn simplification. 

$h_t = gate(x, h_{t-1})$

$h_t = rnn(x, h_{t-1})$

What's the difference in the computation?

$gate(x, h_{t-1}) = g \odot x + (1 - g) \odot h_{t-1}$

$rnn(x, h_{t-1}) = act\_fun(Ux + Vh_{t-1} + b)$

How do we get from a gate to a rnn?

The hadamard product $g \odot x$ could be seen as a diagonal matrix multiplying x. That is, $diag(g)x = g \odot x$ (where diag(k) constructs a matrix with the values of the vector k in the diagonal entries.) In other words, $g \odot x$ is a specific case of $Ux$ where U is diagonal. (Geometrically, a diagonal matrix means we are only stretching the dimensions of the vector without any rotation.)

Now, remember the origin of the parameter g in the gate. (not sure:) almost always, it comes from a sigmoid function. Thus, there we have another constraint!

The fact that we have $diag(g)$ and $diag(1 - g)$ instead of $U$ and $V$ means we are constraining U and V to sum to the identity (and thus to have the same dimension.)

Finally, if the activation function is the identity function and the bias vector is the zero vector, we arrive to the gate.

It's interesting: one step of a rnn is a generalization of a gate. The rnn is much more powerful than the gate, but the specificity and simplicity of the gate could make it more useful for cases where we know that we need a gate. [1]

{a gate doesn't care about the order of the inputs. but a rnn does. is there a version of the rnn where we don't care about the order. ie, where we have either two hidden states or two input states? Having two hidden states h_1 and h_2 is the same as having the hidden state h_3 = [h_1, h_2]. What does it mean to have only one hidden state without any input? Every iteration in the rnn is just a new layer! But with fixed weights over the different layers. is that useful? it's somethign like Wtanh(Wtanh(Wtanh(Wx)))}

{try with grus}

[1] Say we have the space of tools and the space of problems. We can think about a tool as covering part of the space with a mantle. Generally, the more the mantle covers the thinner it is. We can also think about a problem as another mantle with a fixed height of 1. Now, the performance in solving the problem is determined by the integral of the tool mantle that coincides with the problem mantle (in other words, we apply a convolution between the two mantles.) The thicker the mantle the better. But we need the mantle to be covering the subspace of the problem. Thus, a more specific rule applied to the right problems will yield better results than a general tool.

For instance, logistic regression has a specific case an algorithm called gaussian mixture model (gmm) which tries to fit gaussian distributions to the data. If the data comes from gaussian distributions, then gmm will perform better than logistic regression. However, if the data comes from other distributons, then logistic regression will perform much better than gmm. {how big is that difference?}

Say we try to come up with a completely general algorithm, one which mantle is huge {is it infinite?} and covers all the tasks. Then, the No Free Lunch theorem tells us that this mantle will be zero. That is, the performance of this general algorithm will be the same as giving a random answer.


### Similarity measure
First, we define a feature vector. Notice we only care about the interactions that include c; the interactions between m and q are the same throughout the stored facts c.

z(c, m, q) = [c, m, q, c ◦ q, c ◦ m, |c − q|, |c − m|, c^TWq, c^TWm]

{try the generalized_hadamard(x, y) = hadamard(x, affine(y)) = x ◦ Ay}
{try removing values from the feature vector. does the performance change?}
{how will be the interaction between three things?}

G(c, m, q) = nn(z(c, m, q))

with nn(x) = sigmoid(wb(tanh(wb(x))))

What's e_i for?
At each timestep, we want to update the memory based on the facts that are useful. To detect whether a fact is useful, we compute the similarity between the fact and the previous memory and between the fact and the question.

The way we compute e^i puts an emphasis on the last part of the sentence. (not sure:) it seems another valid approach to compute e^i would be to do attention(c, weights=g). In general, I don't know why we would like to have an emphasis on the last part of the sentence. At least, it seems an useful prior/assumption for tasks that asks about the actual state of the world. So if we changed something from place a hundred times, we only care about the last change. I wonder if this generalizes well to all types of tasks.

{it doesn't seem that good that the only way to retrieve a memory that is useful for a representation of the task is by looking at the similarity between the memory and the representation of the task. it might be the case that key-value memory networks solve this problem}

recurrent vs recursive: https://stats.stackexchange.com/questions/153599/recurrent-vs-recursive-neural-networks-which-is-better-for-nlp
how does a recurrent nn looks like with circles and arrows?

## answer module
The last memory contains the answer to the question. But the answer to the question could be a bunch of words. So we take a GRU that can decouple information from the hidden state and produce the words that answer the question. 

$$
a_0 = m^{T_M} \\
a_t = GRU([q, y_{t-1}], a_{t-1}) \\
y_t = softmax(Wa_t) \\
$$
{why do we input the question q at each timestep and we don't do that with the memory m? one answer is that we want to remove everything that we already processed from the hidden state. however, the same applies to the question.}