# Elements of deep learning
It seems there are elements that are used over and over, with slight modifications but the same general underlying principle.

Alternatively, we can see deep learning as the effort of making familiar, non-differentiable operations differentiable by slight modifications. 

## Overview of components
* Vector spaces vs graphs
* Attention
* Transformations
* Gates-RNNs
* Softmax = Softargmax
* Similarity measure

## Vector spaces vs graphs
The number of nodes and edges in a graph is countable. Given that we are in a node, we have countably many directions (edges) we can take. This is useful when we want to analyze them. Consider the extension of a graph to a vector space. We now have uncountably many nodes (each one corresponding to one point in the space.) And we have uncountably many edges between two given points. This is what we needed to apply deep learning.

In the graph, if we change from selecting one edge to selecting other edge, we have a jump (ie, it's not continuous.) The same applies to changing from one node to another node. Note that this doesn't happen in a vector space: we are able to make an arbitrarily small change in the "edge" or node we select 

## Attention
Attention is the differentiable equivalent to the find or lookup operations. 

For instance, for memory-augmented neural nets, we have a memory with several vectors stored. Say we want to retrieve the memory that is most similar to a given key. A non-differentiable way of doing it is by using the equal operation and comparing every memory to the key. To make the operation differentiable, we need to replace the equal for a similarity measure. Also, now we have a similarity measure for every vector in memory, so to softly select one, we apply softargmax.

## Transformations
It's often the case that a change in perspective makes a problem much easier. There are several ways of representing 2 (eg $2,$ $2 * 1,$ $log_e(2e),$ $50^0 * 3 - 1.$) If we need to add it to another number, just using the number 2 could be easier. Now, in the expression $2 + 2 * x$ we can represent the first $2$ as $2 * 1.$ Then, we can represent $2 * 1 + 2 * x$ as $2 * (x + 1).$ And this last expression can be useful to detect some features about the given expression (eg the factors of the expression.) 

### Charactersitics of representations
Different representations are useful for different tasks.

There's a trade-off between consistency and descriptiveness. Often, how consistent is the representation is inversely proportional to how descriptive it is. However, sometimes we can make a representation more consice without big losses in the descriptiveness

Do you remember that for similarity measures we assumed we had continuous vectors? Now consider that several tasks have discrete data as input. Thus, we need a projection that transforms that data into a continuous representation.

### Defining a transformation
Generally, we operate with vectors. Thus, we can define a transformation as a mapping from one space to other space. (There's also the case of transforming from other type of data to vectors.)

Vector to vector
* affine: matrix times input vector 
* wx+b: affine plus bias
* dense: wx+b with an activation function
* mlp: n dense layers.

Other type to vector
* bag of words: given a document, we create a vector that has a counter for each word

Examples. Word embeddings is an affine transformation. 

### Multilayer perceptrons (MLP)
A MLP is composed of several dense transformations. Say we have a MLP that correctly classifies cats. The input image seems to have everything entangled, but as we go through the hidden layers, we have a better and better representation of the input data. When we reach the last hidden layer, a succesful MLP will have a linearly separable dataset.

## RNNs and gates
It seems that rnns and gates are related. We could see the gate as a rnn simplification. 

$h_t = gate(x, h_{t-1})$

$h_t = rnn(x, h_{t-1})$

What's the difference in the computation?

$gate(x, h_{t-1}) = g \odot x + (1 - g) \odot h_{t-1}$

$rnn(x, h_{t-1}) = act\_fun(Ux + Vh_{t-1} + b)$

How do we get from a gate to a rnn?

The hadamard product $g \odot x$ could be seen as a diagonal matrix multiplying x. That is, $diag(g)x = g \odot x$ (where diag(k) constructs a matrix with the values of the vector k in the diagonal entries.) In other words, $g \odot x$ is a specific case of $Ux$ where U is diagonal. (Geometrically, a diagonal matrix means we are only stretching the dimensions of the vector without any rotation.)

Now, remember the origin of the parameter g in the gate. (not sure:) almost always, it comes from a sigmoid function. Thus, there we have another constraint!

The fact that we have $diag(g)$ and $diag(1 - g)$ instead of $U$ and $V$ means we are constraining U and V to sum to the identity (and thus to have the same dimension.)

Finally, if the activation function is the identity function and the bias vector is the zero vector, we arrive to the gate.

It's interesting: one step of a rnn is a generalization of a gate. The rnn is much more powerful than the gate, but the specificity and simplicity of the gate could make it more useful for cases where we know that we need a gate. [1]

{a gate doesn't care about the order of the inputs. but a rnn does. is there a version of the rnn where we don't care about the order. ie, where we have either two hidden states or two input states? Having two hidden states h_1 and h_2 is the same as having the hidden state h_3 = [h_1, h_2]. What does it mean to have only one hidden state without any input? Every iteration in the rnn is just a new layer! But with fixed weights over the different layers. is that useful? it's somethign like Wtanh(Wtanh(Wtanh(Wx)))}

{try with grus}

## Softmax = Softargmax
Softargmax is the differentiable equivalent to the argmax operation.

Selecting the argument that has the maximum value is a very common operation. For instance, if we have an MLP that classifies handwritten digits, we want to softly select the digit that the MLP assigned more chance. 

## Similarity measure
This could be seen as the differentiable version of the equal operation. Almost always, two continuous vectors aren't equal. So if we want to minimize the distance between two continuous vectors, the equal operation won't be useful (it will be almost always false.)

> So we need something else!

### Cosine similarity and L2-distance
(In this and the following subsections, assume that all vectors are unit.) [0] ($\leftarrow$ Notes are below)

The angle between two vectors seems to be related to the similarity between them. The smaller the angle, the more similar the vectors. [2] 

We can define the angle between two vectors, v and w, as follows. Let's think of v and w as 2D vectors. Imagine we draw the unit circle. [#] Now, the vectors v and w go from the point (0, 0) to some point in the perimeter of the unit circle. Let's call $arc_{vw}$ to the arc that goes from the tip of the vector v to the tip of vector w. Also, consider that the circle has perimeter $2\pi$ (because it has radius one.) Now, we define the angle as 

$$\frac{arc_{vw}}{2\pi}$$

The intuition behind this formula is that an angle tells you the fraction of the full circle that our two vectors form.

Now, let's call u to the vector that goes from v's tip to w's tip. We know that $u = v - w.$ And, thus u's magnitude is $||v-w||_2.$ 

Finally, consider that when the radius of a circle tends to zero, 
{prove that cosine similarity is c*l_2 distance}

### Sum

### (Generalized) element-wise product
Imagine we are comparing entries from two vectors. If one value is zero and the other is non-zero, then we can argue that the similarity is zero. However, using additive similarities, the similarity between entries 0 and 0.5 is the same as entries 0.5 and 1. 

Also, a nice property of multiplicative interactions is that if we have a and b with a + b = c for some fixed constant c, the value of a and b that maximizes a * b is when a = b. Thus, if we have the vectors 

a = [.87, .3, .4]
b = [.3, .87, .4]
c = [.65, .65, .4]

additive_similarity(a, b) = [1.17, 1.17, .8]
additive_similarity(c, c) = [1.3, 1.3, .8]

multiplicative_similarity(a, b) = [.26, .26, .16]
multiplicative_similarity(c, c) = [.42, .42, .16]

It seems that the multiplicative similarity cares more about making the values similar.

### Additive vs multiplicative similarities
Thus, with additive similarities we are asking how far are two vectors additively. But with multiplicative similarities we are asking how far are two vectors multiplicatevely. How do we decide the interaction we are gonna use? It depends on how the vectors were generated. Consider the following MLP.

output = affine(input)
input = concat(embedding(x), embedding(y))

The output only has additive interactions between the two embeddings. Thus if we use the representation trained on this MLP for other problem, we would want to use additive similarities. 

Say we have some text documents. And we use bag of words to select some features from the documents. 


### Inner product


### Measuring similarity between more than two vectors


{skip connections, layernorm}

## Notes 
[0] It's important to take unit vectors. Otherwise, say we have a unit vector and a non-unit vector with very large values. Then, both additive and multiplicative similarities will be higher between the unit vector and the non-unit vector than between the unit vector and itself, and that doesn't make sense! Also, instead of taking unit vectors, another way to do this is by computing the similarity and then dividing by the magnitudes of the vectors. For additive interactions, if we add two vectors (not necessarily unit) we need to divide the result by the average of the two magnitudes. For multiplicative interactions, if we multiply two vectors, we need to divide the result by the multiplication of the magnitudes of the two vectors. 

[1] Say we have the space of tools and the space of problems. We can think about a tool as covering part of the space with a mantle. Generally, the more the mantle covers the thinner it is. We can also think about a problem as another mantle with a fixed height of 1. Now, the performance in solving the problem is determined by the integral of the tool mantle that coincides with the problem mantle (in other words, we apply a convolution between the two mantles.) The thicker the mantle the better. But we need the mantle to be covering the subspace of the problem. Thus, a more specific rule applied to the right problems will yield better results than a general tool.

For instance, logistic regression has a specific case an algorithm called gaussian mixture model (gmm) which tries to fit gaussian distributions to the data. If the data comes from gaussian distributions, then gmm will perform better than logistic regression. However, if the data comes from other distributons, then logistic regression will perform much better than gmm. {how big is that difference?}

Say we try to come up with a completely general algorithm, one which mantle is huge {is it infinite?} and covers all the tasks. Then, the No Free Lunch theorem tells us that this mantle will be zero. That is, the performance of this general algorithm will be the same as giving a random answer.

[2] It's interesting to consider how we define the angle in high-dimensional spaces. It seems we have no definition of angle between points. The definition of the angle for lines is explained above. An intuitive example of the angle between two planes is where two walls touch. Imagine a room that has three walls with the same length (the floor and the roof are triangles.) Consider the angle formed by the wall (plane) W and the wall (plane) V.

An intuitive way to define the angle is as follows
* select one point p in W at random
* draw a line l that is perpendicular to V, is contained in W, and passes through p
* call r to the intersection of the line l and the plane V.
* draw a line n that is perpendicular to W, is contained in V, and passes through r
* define the angle between planes V and W as the angle between lines l and n

Another way to define the angle between two planes is to take a third plane U that is perpendicular to V and W. V and U intersect in a line, and W and U intersect in another line. The angle between those two lines is the angle between the two planes V and W.

{How do we define the angle between cubes?}

## Terms
Hyperplane: the generalization of a plane for higher dimensions.

Inner product
(not sure:) recursive neural networks: we apply the a recursive nn to the output of other recursive nn (and so on) keeping the weights fixed. Example: this could be useful for parsing, because we could have sentences inside sentences, so the output of the recursive nn goes again through a recursive nn.

Unit Circle: circle centered in the coordinates (0, 0) with radius 1

~There is some point in the services that increase their performance as time goes. In some point, they give less focus to the functionality of their service and focus more on aesthesic qualities (this could happen because there aren't much more to do in the functional side.)

