# Neural Networks

### Idea: create an encoder for every type of media (text, images, audio) and also a decoder for every type. Train every combination of encoder-decoders with a fixed representaiton in the middle, so as to force them to find a way to represent different types of data in the same way. 
### Idea: in the LSTM cell we may want the vectors i and f to sum to one. 
### Introduction to learning
Let's say there is a function $g(x)$ out there that, given some input, produces some output (eg a human reading a handwritten digit and saying the number.) We are interested in creating our own function $f(x)$ and make it as close as possible to $g(x).$ Now, usually we don't know how $g(x)$ works. But we have input-output pairs that we can use to learn how it works. [a]

#### Table
We could save each input-output pair we observed from $g(x)$ in a table. Then, when we want to know the value $f(x)$ for some $x,$ we could refer to this table. 
$$
\begin{align}
x\  &| f(x) \\
--&-- \\
x_1 &| g(x_1) \\
x_2 &| g(x_2) \\
x_3 &| g(x_3)
\end{align}
$$

Now, this has two problems: first, it doesn't generalize to new cases; second, we would need a lot of storage! 

#### Neural network
We don't need to store the exact value of $g(x_i)$ for every $x_i.$ In fact, our observations of the function $g$ are usually noisy. So, it's enough for us to store an approximated value of $g.$ Now, we need to deal with the problem of new generalizing to new points. This process can be seen as having a bunch of points in a 2D plot where we want to find a line such that it passes near the points in the plot. That line has a set of parameters $w$ (also called weights.)

This makes us wonder about the central question
> how do we tune the weights $w$ such that $f$ is as close as possible to $g$? 

### Introduction to neural nets
Let's call the set of input-output pairs a training set. Formally, we have $m$ pairs, the input vector has length $n$, and the output vector length $k.$

$$
X \in R^{n \times m} \\
Y \in R^{k \times m}
$$

Now, let's call $x^{(i)}$ and $y^{(i)}$ to individual training cases. We can easily measure the error with them, because we have our prediction $f(x^{(i)}; w)$ and the ground truth $y^{(i)}$ (which is equal to $g(x^{(i)})$).

A very simple function to use is $f(x; w) = \sum_i w_i x_i.$ Notice that if $x$ is the zero vector, we may want to have an output different to $0.$ Thus, we add a bias to the function $f.$ Now, $f(x; w, b) = b + \sum_i w_i x_i.$ To make it easier, we can put all the inputs in a vector and add a one for the bias $x \in R^{n + 1}$ and we put the weights in a vector and add the bias in $w \in R^{n + 1}.$ So now $f(x; w) = w^Tx.$

#### Measure of the error
Let's say we want to calculate how bad our prediction $f(x^{(i)}; w)$ was. 

The starting point is just the difference between the prediction and the reality. Formally, 
$$f(x^{(i)}) - y^{(i)}$$
The problem is that the differences could be both positive and negative. Thus, for instance if we have an error of -2 and an error of 2 in two different training cases, the average error should be 2, but if we add and average -2 and 2 we get 0. Thus, we can add an absolute value. Formally, 
$$|f(x^{(i)}) - y^{(i)}|$$
This has another problem. If you look at $f(x) = |x|$ you will notice that it isn't differentiable at 0. Even if you continue zooming at $x=0,$ the function won't become smooth. The importance of having a differentiable error function will become clear below. Now, squaring the error resolves the issue.
$$(f(x^{(i)}) - y^{(i)})^2$$
Thus, the error of a training case for a set of weights is
$$E_{i}(w) = \frac{1}{2} (f(x^{(i)}; w) - y^{(i)})^2$$
Now, we can calculate the error of the entire training set as 
$$E(w) = \sum_{i=0}^m \frac{1}{2} (f(x^{(i)}; w) - y^{(i)})^2$$
(The presence of $\frac{1}{2}$ will become clear in a minute.)


#### Derivatives
Recall that $f$ was parameterized by $w,$ so to answer the central question (how do we tune the weights $w$ such that $f$ is as close as possible to $g$?) it would be great to know how changing $w$ changes the error function we just defined. 

$$\frac{dE_i}{dw} = \frac{dE_i}{df} \cdot \frac{df}{dw} = (f(x^{(i)}; w) - y^{(i)}) \cdot x^{(i)} $$
$$\nabla E = \frac{dE}{dw} = \frac{dE}{df} \cdot \frac{df}{dw} = \sum_{i=0}^m (f(x^{(i)}; w) - y^{(i)}) \cdot x^{(i)} $$

Now we have the direction that reduces the error. Let's see two methods to use that information.

### Gradient descent
One intuitive idea is that the derivative we just calculated is a good approximation of the optimal path to take (if we restrict us to a small region.) This is related to manifolds [#]. Thus, we update $w$ taking small steps in the direction determined by $\nabla E.$ Formally, 

$$w := w - \alpha \nabla E.$$

What's happening in the small scale is that for each weight $w_j$
$$w_j := w_j - \alpha \frac{dE}{dw_j}$$

#### Stochastic gradient descent
It turns out that the method above is a little inefficient, because with just a bunch of training cases we can have a very good idea of the best direction. This is more so in the beginning, when we start with random and bad weights.

### Newton's method
Taylor's theorem says that for every point $x$ {TODO: prove this}
$$
h(x) = \sum_{n=0}^\infty \frac{h^{(n)}(a) \cdot (x - a)^n}{n!}
$$

We can use this to get a good approximation for $x$ near $a.$
$$
h(x) \approx h(a) + h'(a)(x - a) \quad\quad\quad\quad  (1)
$$

Remember that we want is to find the minimum value of the function $E(x)$ (that would mean that $f$ is as close as possible to $g.$)

If $\frac{\partial E}{\partial w} = 0$, we know that we are either in a local minimum or maximum. For convex problems [#] we don't have local maxima, so $\frac{\partial E}{\partial w} = 0$ means we are in a local minimum.    
Now, we can reach $\frac{\partial E}{\partial w} = 0$ by iterative steps using (1). That is, we start at some random point $x_0$, and we iteratively approximate $h(x_{i+1})$ from $h(x_i)$ with the following formula.

$$
\begin{align}
h(x_{i+1}) \approx h(x_i) + h'(x_i) \cdot (x_{i + 1} - x_i) &= 0 \\
\frac{\partial E}{\partial w} + \frac{\partial}{\partial w} \frac{\partial E}{\partial w} \cdot (x_{i + 1} - x_i) &= 0 \\
H \cdot (x_{i + 1} - x_i) &= -\nabla_w E \\
x_{i + 1} - x_i  &= -H^{-1}\nabla_w E\\
x_{i + 1} &= -H^{-1}\nabla_w E + x_i \\
\end{align}
$$

The code below implements Newton's method. It seems very fast. One problem it has is that we have to calculate the second order derivatives, and that's pretty costly. If calculating the first order derivatives is O(x) then calculating the second order derivative is O(x^2) [#]

In [2]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import time

In [3]:
f = lambda x: -2 * (x ** 3) - x ** 2  + 3 * x
df = lambda x: -6 * (x ** 2) - 2 * x + 3
fig = plt.figure()

def plot_funs(p, error):
    plt.cla()
    x = np.linspace(-2, 2, 100, endpoint=True)
    fx = f(x)
    dfx = df(p) * x - (df(p) * p - f(p))
    plt.title(f'Error: {abs(error)}')
    plt.plot(x, fx)
    plt.plot(x, dfx)
    plt.axvline(0)
    plt.axhline(0)
    plt.ylim(-15, 15)
    fig.canvas.draw()
    plt.pause(1)

x = 0.7
while f(x) != 0:
    next_x = x - f(x)/df(x)
    plot_funs(x, f(x))
    x = next_x

<IPython.core.display.Javascript object>

### Linear regression
Up to this point, we build an algorithm that is called linear regression. We have a bunch of training cases and we are modeling $f(x)$ as a linear function of the inputs. It turns out that there is a closed solution to the linear regression problem that will take us directly to the global minimum. However, we can also use gradient descent to arrive to it in small steps. (For more complex versions of $f,$ we won't have the closed form solution.)

### Hand-crafted features
The function $f(x; w) = w^Tx$ is pretty boring. In 2D, it corresponds to just a line. And there are multiple datasets that aren't linearly separable.
<div align="center">
    <img src="https://sites.google.com/site/datasciencenotebook1/_/rsrc/1471675067253/binary-classification/kernel-methods/Screen%20Shot%202016-08-20%20at%2012.06.34%20PM.png" style="width: 200px; display: inline"/>
    <img src="https://static.commonlounge.com/fp/original/FuOm0jwxAL6WfSdX9h0Lkj9ka1520492081_kc" style="width: 200px; display: inline"/>
</div>

However, the input to the function $f$ isn't limited to be $x$, it could be any product of the elements of $x.$ The vector $[x_1, x_2, x_1x_2, x_1^2, x_2^2]$ is completely valid. Thus, linear regression will continue to find a hyperplane to separate the data. However, thanks to the quadractic features, linear regression may be able to find a transformation of the space where the dataset is separable by a hyperplane.

### Non-linearities
The problem with the hand-crafted features is that it has the word "hand" in its name. It's better if we don't need to hardcode the features, but instead our learning algorithm comes up with the optimal ones. So, the first step towards separating a dataset with something else than a line is to apply a non-linear function $h$ (also called activation function.) Now we have $f(x; w) = h(w^Tx)$ 

#### Sigmoid
$$h(z) = \sigma(z) = 1 + \frac{1}{e^z}$$

Advantage: output between 0 and 1 (useful if we want to consider the output as a probability of something happening.)

Disadvantage: the near-zero derivatives in the extremes make learning slower. 

#### Tanh
$$h(z) = tanh(z) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

Note that both $e^x$ and $e^{-x}$ are positive. Thus, $|e^x - e^{-x}|$ can't be larger than the denominator. And that's why tanh is bounded between $-1$ and $1.$

Tanh is superior to sigmoid, except for output units.

Advantage: if the input has mean near zero and most of the inputs are in the linear part of the tanh, then the output will also have a mean near zero.

Disadvantage: the near-zero derivatives in the extremes make learning slower. 

#### ReLU
$$h(z) = max(0, z)$$

Advantage: we don't have near-zero derivatives.

Disadvantage: we have zero derivative for z < 0. It's not that bad, because we generally have z > 0 for a lot of units

#### Leaky ReLU
$$h(z) = z \verb| if | z > 0 \verb| else | 0.001z$$

Advantage: we don't have zero derivative for z < 0.

#### Remark
It's interesting how every activation function builds upon the disadvantages of the last ones. When we solve something, we may have another problem. For instance, when we use tanh to solve the fact that sigmoids don't have zero mean, we can't interpret the output as a probability distribution anymore.

#### Thinking of another activation function {IDEA}
What if we combine ReLU with tanh at the left?

The problem with tanh/sigmoid is that they have a near-zero derivative at both sides of the function. As ReLU works, we know that it isn't a problem to have zero derivatives at one side of the function. So, I think we could try this:

$$f(x)=\{x>0:\ x,\ x<0:\tanh(x)\}$$
or
$$f\left(x\right)=\left\{x>0:\ .2x+.5,\ x<0:s\left(x\right)\right\}$$


### Logistic regression
Similarly to linear regression, if we define $f(x; w) = \sigma(w^Tx)$ we get the algorithm called logistic regression. 

It's interesting that the decision boundary from the logistic regression is linear even if we are using one activation function. I think that happens because we output a 1 if $\sigma(wx + b) > 0.5$ and a 0 otherwise. Thus, we have a piecewise activation function that it's formed by two lines.


### Multilayers
Again, this isn't that interesting, because the activation function we use determines the shape of the data we can model. And we want to build an algorithm that works well without assuming any probability distribution where the data comes from. [TODO: picture with four cases gaussian/linear good/bad]

So, what we want to do is to compose functions. Note that if we compose linear functions, the output will remain being a linear combination of the input. Thus, there's no sense in combining linear functions.

### Normalization
#### Idea of normalizing
We can picture the error surface as a bowl. This bowl can be elongated or stretched. In those cases, the gradient doesn't point towards the center of the bowl (the minimum.) Instead, it's more concerned about the ladder. We can see this in the diagram below. The lines represent the level curves [#]. The x represents a set of weights and the arrow the gradient.

We don't like shapes like this. We want a circle instead of an ellipse, so that the gradient points directly to the minimum. 

Let's think about what produces this ellipses. Say a neuron has two inputs and the following training cases
$$
x_1 = .1, x_2 = -10, y = 1 \\
x_1 = .1, x_2 = 10, y = -1
$$
This neuron is much more sensible to the second input than the first one. Specifically, changing $w_2$ will affect the cost $100$ times more than changing $w_1.$ Thus, we get a shape like the one in the diagram above.

[TODO: say that the same happens when we have x1 and x2 as axes]

When we see the diagram above and notice that the component of the gradient in the w1-axis is pretty low, we could say "easy, let's increase the learning rate!" The problem there is that this would make things worse, because the compontent of the gradient in the w2-axis will be bigger and will end in having a lot of oscillations. From this perspective, it makes sense to have an different learning rate for each axis.
https://distill.pub/2017/momentum/



#### Weights
The set of weights that connect layer $l$ to layer $l + 1$ have to be random. Otherwise, if all the weights are the same (in particular, if all the weight are zero) the gradient will be the same for every neuron in the layer $l$ and thus all neurons will compute the same function forever. An alternative approach will be to add noise to the gradient [idea: try this]

What happens if we have a neuron that has 1000 incoming weights? Even if the input is normalized (ie mean=0 variance=1) we will need to make the weights small, otherwise the variance of $\sum_i^n w_ix_i$ will be too large. Specifically, the variance of that sum is directly proportional to (a) the variance of the weights, (b) the variance of the inputs, and (c) the amount of connections. Let's say we have (b) and (c) fixed. Then, we want to find the optimal value for (a). 

$$
\begin{align}
Var(\sum_iw_ix_i) &= \sum_iVar(w_ix_i) \tag {1} \\
&= \sum_iVar(w_i)Var(x_i) \tag {2} \\
&= nVar(w)Var(x) \tag {3} \\
\end{align}
$$

In (2) we used the fact that $w_i$ and $x_i$ are independent and have zero mean [link to prob theory]. In (3) we used the fact that all variables $w_i$ are identically distributed (and the same applies to $x_i.$)

Thus, if we want $Var(\sum_iw_ix_i) = 1,$ then the best value for $Var(w)$ is $1/(n \cdot Var(x)).$ If we know that the input has unit variance, then the best value is $1/n.$

In [32]:
n = 100
trials = []
for _ in range(5000):
    x = np.random.randn(n) * 2 #Var(x) = 4
    w = np.random.randn(n) / 20 #Var(w) = 1/400
    trials.append(sum(w * x))
np.var(trials) #Var(wx) = n * var(x) * var(w) = 100 * 4 * 1/400 = 1

1.0448629656936872

Notice that when we multiply a random variable by $k$, its variance gets multiplied by $k^2.$ To understand this, let's say the mean was $a$ and there was another point at $b.$ Now, those points arrive in $ka$ and $kb.$ The distance between them now is $kb - ka = k \cdot (b - a).$ Before that, the distance was $b - a.$ So, the distance increases by $k.$ But the variance measures the squared distance. Thus, the contribution to the variance for that point is $k^2$ times more than its contribution before. 

#### Inputs
Normalize. Decorrelat

### Cost functions revisited
{TODO: cross entropy}
#### Softmax function
Question: does the softmax change the order of the probabilities?

#### Locally weighted algorithm
The training data is noisy. Thus, we don't want to fit outliers, because they may not represent the real distribution of the data we are trying to model. One definition for the error is
$$E(w) = \sum_i(y_i - w^Tx_i)^2$$
Now, we want to find a way to give less importance to points that are outliers and more importance to points near the mean.
$$E(w) = \sum_i \theta_i(y_i - w^Tx_i)^2$$
For $\theta$ we want an always-positive value. Also, we want a high value if $x_i - mean(x)$ is small, and we want a small value if $x_i - mean(x)$ is high. Note that we care about the absolute value of the difference between $x_i - mean(x)$ that's why we have the squared term.
$$\theta = exp(-(x_i - mean(x))^2)$$
Now, the distribution of the weights is fixed. That is, with the expression above for $\theta,$ we can't change from caring a lot about the values that are near the mean, to caring the same for every value. That's why we add a parameter. Note that as $\tau$ tends to inifinity, $\theta$ tends to 1 and we recover our original erorr term.
$$\theta = exp\bigg(-\frac{(x_i - mean(x))^2}{2\tau^2}\bigg)$$

### Underfitting/overfitting/bias/variance
Overfitting: when the nn captures regularities in the trani data not present in test data.

## Maximum a posteriori
In Maximum Likelihood Estimation, we want to get the parameter $\theta$ that makes the data $p(y|X)$ most likely. 

MLE: $argmax_\theta \  p(y|X;\theta)$

However, we can also get the parameter that is most likely given the data 

MAP: $argmax_\theta\ p(\theta|X, y)$

They are the same. In the end, we are trying to get the $\theta$ that maximizes $p(X, Y, \theta).$

page 12 how from \epsilon to p(y|x;thta)
page 11 why trace?

## Margins
https://people.eecs.berkeley.edu/~klein/papers/lagrange-multipliers.pdf
We want to know how far are we from the data.
Functional margin: a good fit is that if $y = 1$ then we want $h_\theta(x) >> 0,$ and if $y = 0$ then we want $h_\theta(x) << 0$
Geometric margin: the distance between the points and the boundary

### Functional margin 

Mathematical programming: optimizing with constraints.
Linear programming: we have a linear relationship between our variables and a linear constraint. 
Quadratic programming: we have a quadratic relationship between our variables and a linear constraint.
Affine: Linear function + intercept term

Lagrange multipliers R. T. Rockarfeller (1970), Convex Analysis, Princeton University Press

3:30 wihtout lights

https://people.eecs.berkeley.edu/~klein/papers/lagrange-multipliers.pdf

### Terms
Manifolds: A n-dimensional manifold is a space that when you zoom in enough you get a n-dimensional euclidean space. For instance, earth is a 2D manifold because although it's a 3D sphere in the big scale, it resembles 2D euclidean space in a small region.

Convex optimization: convex problems are the nicest because there is only one local minimum and no local maximum. In a 2D plot, we can picture it as a U. In a 3D plot, we can picture it as a bowl. It doesn't need to be a symmetric bowl. Also, it can be elongated in some axis. Note that this doesn't mean it's a trivial problem. In an elongated bowl, the gradient doesn't point to the global minimum, so we could take a long time to reach the global minimum. What's true is that we would reach the minimum at some point.

$O(\cdot)$: the amount of operations a computer has to perform with respect to some input variable. For instance, if we have a vector of $n$ dimensions and we want to calculate its third derivative, we have a complexity of $O(n^3)$

Level curves: level curves are a way to represent 3D plots in 2D. In these plots, for every value $k$ in a set of values, we draw a curve that goes through every point $x, y$ in the function where $f(x, y) = k.$ It's good if the values in the set have the same distance between them (eg set = {0, 10, 20, 30}) so we can get an idea of the 3D shape. 

Types of function from set A to set B:
* bijection: every point in A is paired with exactly one point in B, and every point in B is paired with one point in A.
* injection: every point in A is paired with exactly one point in B.
* surjection: every point in B is paired with at least one point in A.

### Notes
[a]: In the past, a human wrote specific code to solve a task (eg to recognize a face, we built eyes recognizer, nose recognizer, and so on.) In contrast, now we build a neural net that takes an input-output pair and learns the function. (https://medium.com/@karpathy/software-2-0-a64152b37c35)

# To do
Lagrange multipliers


# To process
[TODO: maximum likelihood interpretation]

[TODO: GLM]

try a module that receives two word embeddings and returns one word embedding
Try word embeddings with one/two/three number. It should have enough memory (2^32, I think)
In bottou 2011 we use the same space for individual words and for higher concepts. Why don't we use two/three different spaces.
http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

### https://cs.stanford.edu/people/dorarad/mac/blog.html
A neural network learns a direct mapping from input to output, but lacks a good understanding of that mapping. If we change some details in the input, it doesn't work anymore.

Three units
* input unit: create a good distributed representation. For words: first we translate to word-embeddings, then we translate we add context to each word.
* recurrent unit:
    * Control: 
        * we want to extract the desired operation. We need to keep track of the past operation, because that seems to help in doing long-term processing. Also, we want to see the global task we are trying to solve. In a QA task, we want to see the whole text of a question. So the first operation is to combine the past operation c_{i-1} with the whole text of the question, biased towards step i. This tells us what operation we are trying to do.
        * Then, we change the basis of the result from above to a linear combination of the words in the question. This acts as a regularizer (and it's also helpful for interpretability.) (A valid output from this unit could be: actual object := object that is to the right of the actual object. Another could be to create a new identity, object2 := object that has the same color as object1. .)
    * Read:
        * we want to retrieve selectively from our knowledge base and the memory from the last iteration. Thus, we attend the knowledge base based on (a) the knowledge base, (b) control output, (c) previous memory output.
        * steps
            * To do this, we compute the interaction between the knowledge base and the previous memory output to see if there are things that are important from the knowledge base in light of the previous memory output.
            * Then, we concatenate the result of that interaction with the knowledge base, to have a fresh version of the kb if the task requires it.
            * Then, we concatenate the (interaction + kb) with the control output. Then we pass that information through a neural net to decide the attention weights on the kb. 
        * in other words, we input a neural net (a) the kb, (b) the interaction of kb and previous memory, and (c) the control output. Then, that nn decides what to look at the kb.
    * Write
        * we already computed the new bit of memory r_i in the read unit. However, we would lose information if we just keep r_i. That's why we use the write unit. We want to integrate the new r_i bit of memory with the previous memory, and remain in the same space. That means we need at least one gate 
        * steps
            * we know what we want to combine: the whole memory so far and the new bit of memory. Thus, we concatenate both things and pass the result through a nn.
            * self-attention: we compute the similarity between the actual control bit and all the previous control bits. We take that distribution and then we create the weighted average of the memory bits (thus, we will give more importance to the memory bits that had similar control bit to our actual control bit.) Then, we concatenate that average to m_i^{prev} and pass it through a nn. (Why isn't it enough to concatenate them? Possible answer: we want a better representation of the data.)
            * what if we don't need so many steps? We have a gate that conditioned on the control unit, we can just let the previous memory go through without interacting with all the computed so far. (useful if our task is simple and to avoid vanishing gradients.)
          
* output unit

{TODO: try visualizing the value of the gate in the write unit. does it saturate to 0 or 1, or it tends to be between 0.3 and 0.7?}
{TODO: think/try of models that are recursive, so we can invoke (poke) a new set of reasoning units while are are in the origin sequence of processing. }
{TODO: I think that the kb here is small. It gets interesting when you have much more information. The problem there seems to be accessing it (attention is inefficient.) So we can try recursion or other methods.}

We have four basic operations
* Concatenation(x, y): it doesn't add interactions. 
* Feedfoward_NN(x): computes a better representation of x  
* Elementwise product(x, y): measures the interaction between x and y
* Attention(a, x): computes a weighted version of x. 


### https://arxiv.org/pdf/1410.5401.pdf
Addressing mechanisms
* Content-based memories (like hopfield networks): the controller produces an approximation of the memory and compares it with the storage to recover the exact memory. 
* Location-based memories: the content of the memory can be arbitrary. For instance, if we want to calculate the multiplying operation, the two numbers we multiply are arbitrary.


Next: Graves et al. (2014; 2016) cosine similarity

### https://arxiv.org/pdf/1102.1808.pdf
Symbolic reasoning leads to a combinatorial explosion. We don't like them. However, it's difficult to achieve causality with probabilistic models. Another advantage of probabilistic models over the symbolic approach is that the prob. model is continuous, and the distances between elements is meaningful, and if two things are near they are similar. This helps in generalization, because we can assign meanining to dog using a lot of traning cases and transfer some of that meaning to cat even though we didn't see cat being used (the only thing we need is to see a sentence that tells us that cat and dog are similar words.)
If we have only a few training cases for a given task, we can use modules trained for other task. In this way, we can create nets that are composition of modules.

#### Sentence bracketing
There is a problem: how do we concatenate the data. For instance, if we are using words, and modules that take two words and output a higher concept in the same space, then do we go from right to left? Do we build the meaning of the sentences and then concatenate sentence by sentence?
One option is to use a module that given a vector in the representation space, it returns us how meaningful it is. Then, we sum all the intermediate terms. Other option is to apply this module only to the output concept, after all the 2 module concepts. 
Two approaches: beam search (limited memory best-first search) and greedy (take the two with most saliency score) 

#### Relation to humans
Short-term: stack of concepts (either lower or higher concepts)
Long-term: parameteres of A and R (A decides how to convert lower concepts into higher. R tells us how good is a given union of concepts.)


https://arxiv.org/abs/1808.09772


## Notes from CS230
Buckets
First: FC
Second: CNN/images
Third: sequence: rnn, attention
Fourth: rl, everything else.

Types of data
Structured: a table with values (eg predicting house prices)
Unstructured: raw data (eg images or audio)

Why is deep learning working
Old methods didn't improve past some large amount of data. 

Model = architecture + parameters
We can use human performance to compare

anchor positive negative

x-means/k-means

Neural style transfer:
style: I extracct non-localized information
content: local information (edges)
We use the following loss |style_s - style_g| + |content_s - content_g| and we use a fixed deep network already trained. Then we backpropagate to the image. (starting with white noise/the content image.) What if we try this on text?

We can create data synthetically by getting three buckets: positive words, negative words, background noise and then mixing them together.

Fourier transform: convert a signal to a weighted sum of sines and cosines.

### Logistic regression
Good way to store data
$X \in R^{n \times m}$

$Y \in R^{1 \times m}$

First idea (linear regression): $\sum_i w_i \cdot x_i$ You have a real number, not a probability between 0 and 1. 

Second idea (logistic regression): $\sigma(\sum_i w_i \cdot x_i)$

Now to correct this, we need to know how bad we are, so that's why we need to define a cost function. Now, for logistic regression, if we use the squared cost function we get a non-convex optimization problem (why?) Thus, we use the cross entropy

### Calculus
An interesting way to think about derivatives is to think about computation graphs. In these graphs, we have different units, and each unit can be calculated individually and then used as building blocks.
Blog entry about calculus and derivatives.

Why sometimes we have * and sometimes dot product?

### Bias
It's interesting that gradient descent doesn't take into account the value of the biases (at least it doesn't do that directly.) Are we wasting information by not using that value?


### Python recommendations
Don't use data structues that have shape (n,). Instead use (n, 1)
You can use asserts to be sure about the shape of an array.
We can think of every activation function or every layer as a module.
Forward: we get the input from the previous layer, we compute the output, and cache some things
Backward: we get the gradient flow from the next layer, we compute the gradient for the weights and the gradient flow for this layer, and use the cache.

### Perceptron algorithm
Let's see an example of what we learned before a neural network. Let's build a classifier that outputs either a 0 or 1 and receives n different inputs (eg the pixels of an image.) Formally, our classifier is a function $f: R^n \to \{0, 1\}.$ 

In the plot we see that there are two dot classes the blue ones and the red ones. We want a linear classifier that separates them. We can see how the line starts classifying every dot in the same class, and step by step it corrects itself. The green dots represent the current dot which we use to update the line. If the green dot is in the right class, then the orange line won't move. If it isn't, the line will move to a position that correctly classifies it.

#### Bias
It's interesting to see what happens if we don't add the bias. (To do that, you can assing bias=False in the beginning of the code.) Even if a dataset is linearly separable, it could be non-linearly separable if we add the requirement that the line has to pass through the origin (like the dataset above.) 

In [3]:
fig = plt.figure()
bias = True

def f(x, w):
    return np.dot(w, x)

def plot_data(w, training, x_tr):
    plt.cla()
    for x, y in training:
        color = 'red' if y else 'blue'
        color = 'green' if x == x_tr else color
        plt.scatter(x[0], x[1], color=color)

    x = np.linspace(-1, 2, 100)
    y = -(w[0] * x + w[2]) / (w[1] + 1e-8)
    plt.plot(x, y, color='orange')
    plt.ylim(-3, 3)
    fig.canvas.draw()
    plt.pause(1)

w = np.array([1, 1, 1]) if bias else np.array([1, 1])
training = [
    ((1, 1), 1),
    ((1, -2), 0),
    ((-1, 2), 0),
    ((1, 0), 1),
    ((2, 1), 1),
    ((0, 2), 0),
]

for i in range(5):
    for x, y in training:
        plot_data(w, training, x)
        if bias:
            x = np.concatenate((x, [1])) #We add the bias
        if f(x, w) >= 0 and y == 0:
            w -= x
        elif f(x, w) < 0 and y == 1:
            w += x

<IPython.core.display.Javascript object>