# Neural module networks
We can draw lessons from formal systems. Eg, conjunction, disjunction, does it exist?

Why is this required? We are performing a particular transformation depending on the image and text we are using. In other works (Key-Value mem nn), the transformation is fixed. That is, we have a matrix A and a feature function \phi that is the same for every question/memory. Here, depending on the question {and on the image/memory}, we use some specific program. 

We model two probabilities. 
* p(z|x): given the question x, what's the probability of this module z.
* p_z(y|w): given that we are using module z and we're in the world w, what's the probability of this answer y

* [[ z ]]_w: output of layout z given module w.
* p_z(y|w) = ([[ z ]]_w)_y #We assume that the output of the layour z are a prob distribution over all the answers y. 


## Modules
### Lookup (-> attention)
Given a word i, we output a one-hot vector with a one in the position of i's index. 
lookup[i] = eye[i]

### Find (-> attention)
We transform v_i and W to have a better suited space. Then, we compute the similarity by just adding the two transformed variables. Then, we compute softargmax to softly get the index of the vector in the world representation that is most similar to v_i. {what if we use other type of similarity measure instead of adding. Is there a reason behind using adding instead of, say, hadamard? It seems adding could be more "stable" (because it's simple.)}

softargmax(similarity(affine(v_i), affine(W)))
similarity(x, y) = x + y + bias
affine(v_i) = Bv_i
affine(W) = CW

### Relate (attention -> attention)
Same as above, but now instead of just looking for w_j's of W that are similar to v_i, we also care about w_j's that are similar to the w_j's selected in the last attention. It's not only about a smooth transition in attention, but also that we can use the previous attention as information to know what to do in the next attention (eg given that we have an attention for Georgia, $relate[in](Georgia)$ looks for what entity has the relation "in" with Georgia. Other option is to have $relate[nose](face)$ where we go from general to specific. Other option is to go from specific to general. Other option is to slowly scan an image so we condition on the previous attention, but we want to move a little bit to the right.)[0]

softargmax(similarity(affine(v_i), affine(W), affine(last attention)))

### And (attention* -> attention)
and(h^1, h^2, ..., h^n) = h^1 \odot h^2 ... \odot h^n
{what about the or module? A logical or is just adding the n attentions.}

### Describe (attention -> labels)
Given that the model is focused on some part of the world representation, we ask it to describe some particular characteristic of that world. 

$$
describe[word](h) = softargmax(S(W, word, h))
S(W, word, h) = transform(similarity(T(W, h), represent(word)))
T(W, h) = transform(attention(W, weights=h))
$$

It seems that the generality of the input of a module varies. 

I imagine having only entities in W, and not abstract words like city. Also, if we want to attend the portion of the world representation that are cities, the vector for the category cities has to be similar to all specific city vectors[1]

### Exists (attention -> labels)
{why do you have a softmax here?}


## Parse
We parse the question into a tree. Then, we convert portions of the tree into calls to modules. (eg proper nouns -> lookup[noun], general nouns -> find[noun], preposition + noun -> relate[prep](find/lookup[noun]).) And then we compose the modules. Finally, on top of the composed module we add a exists/describe module, to map from attention into labels.[2]

Given a set of layouts, we want to score them. Thus, we define a function f that outputs features from a given layout z_i (specifically, it outputs the number of modules of a given type that the layout has and the parameters used.) Then, we transform both the question x and the result of f(z_i), and compute the similarity between them. As we just compute the elementwise similarity, we then multiply by a vector to get only one real for the score.

(x being the question)

score(z_i|x) = transform_1(similarity(transform(x), transform(z_i)))
transfrom_1(s) = a^Trelu(s)
similarity(v, u) = v + u + bias
transfrom(x) = LSTM(x)
transform(z_i) = affine(features(z_i))

p(z|x) = softmax(score(z|x))

The model easily scores actions (ie p(z|x)) (ie parse the sentence and process each model.)
The model takes time scoring the answers given the layout and the world representation (ie p_z(y|w))

It seems this is so because the world representation could be huge, but the amount of modules we have in our model can be small. An action can be easily explained in abstract. But when the action has specific entities, then it takes more time to process them. This resembles a scenario in RL [*]

{attention -> attention sounds interesting. say we have one vector and an array of vectors. If we use attention, we are going to generate a prob distribution over the array of vectors. Now, given the prob distribution and the vector, can you recover the array of vectors. Also, using the prob distribution and the array of vectors, can you recover the vectors?}


Ex: 
A = [[1, 0], [1/sqrt(2), 1/sqrt(2)], [0, 1]]
v = [3/sqrt(5), 4/sqrt(5)]
p = [3/sqrt(5), 

we have unit vectors, so a^2+b^2 = 1
Also, a*x + b*y = c
So given x and y, we can recover a and b

K-Now, say you you have an n-dimensional vector, and you attend over k n-dimensional vectors (with k \geq n.) Now, if the k vectors we attend over are different, then we have k systems of equation relating the n variables. We take n of those equations and we can solve exactly for the values of the key vector. (if we are using unit vectors, then we only need n-1 equations.) {how can we approximate the answer if we don't have that many vectors in memory.}

{what about the reverse side}

Elements: the the magniude of the sum of two unit vectors computes its similarity. 
a = [1/s(4), -s(2)/s(4), 1/s(4)]
b = [-s(3)/s(5), 0, s(2)/s(5)]
a+b = [-0.27459667, -0.70710678,  1.13245553]
||a+b||=1.36
a+a=[ 1.        , -1.41421356,  1.        ]
||a+b||=2
c = [1/s(2), 1/s(2)]
d = [-1/s(2), -1/s(2)]
||c+d|| = 0


Continue: reinforce rule on page 6. Also: read http://www.scholarpedia.org/article/Policy_gradient_methods Also, try to avoid continue reading papers and focus on deep understanding.

## Notes
[0] This sounds super interesting.
[1] How can we arrive from several vectors of cities to the vector for the category cities? We need some type of average think.
[2] {it seems the model can't use its own language in this process. It's always forced to use an attention over the world representation. It seems similar to humans, we can't (or it's difficult to) think about things that aren't something of our world, that can't be expressed as a combination of world's primitives. Imagine this simplified case. We have a list of vectors that represent all the primitives in the world. Say they are n-dimensional vectors, but the whole list spans a (n-1)-dimensional space. Then, are humans capable of having thoughts corresponding to vectors in the nullspace of this n-dimensional space? 

## Terms
Scenario in RL: It's cheap to evaluate actions, but expensive to execute them. Humans usually consider several options before taking a decision. This seems to be because we can easily score an action in our mind but taking it could take much more time (and it may have bad/irreversible consquences)