## Neural nets 2
Here we are analyzing more complex concepts about neural nets.

### Normalization
#### Idea of normalizing
We can picture the error surface as a bowl. This bowl can be elongated or stretched. In those cases, the gradient doesn't point towards the center of the bowl (the minimum.) Instead, it's more concerned about the ladder. We can see this in the diagram below. The lines represent the level curves [#]. The x represents a set of weights and the arrow the gradient.

We don't like shapes like this. We want a circle instead of an ellipse, so that the gradient points directly to the minimum. 

Let's think about what produces this ellipses. Say a neuron has two inputs and the following training cases
$$
x_1 = .1, x_2 = -10, y = 1 \\
x_1 = .1, x_2 = 10, y = -1
$$
This neuron is much more sensible to the second input than the first one. Specifically, changing $w_2$ will affect the cost $100$ times more than changing $w_1.$ Thus, we get a shape like the one in the diagram above.

[TODO: say that the same happens when we have x1 and x2 as axes]

When we see the diagram above and notice that the component of the gradient in the w1-axis is pretty low, we could say "easy, let's increase the learning rate!" The problem there is that this would make things worse, because the compontent of the gradient in the w2-axis will be bigger and will end in having a lot of oscillations. From this perspective, it makes sense to have an different learning rate for each axis.
https://distill.pub/2017/momentum/



#### Weights
The set of weights that connect layer $l$ to layer $l + 1$ have to be random. Otherwise, if all the weights are the same (in particular, if all the weight are zero) the gradient will be the same for every neuron in the layer $l$ and thus all neurons will compute the same function forever. An alternative approach will be to add noise to the gradient [idea: try this]

What happens if we have a neuron that has 1000 incoming weights? Even if the input is normalized (ie mean=0 variance=1) we will need to make the weights small, otherwise the variance of $\sum_i^n w_ix_i$ will be too large. Specifically, the variance of that sum is directly proportional to (a) the variance of the weights, (b) the variance of the inputs, and (c) the amount of connections. Let's say we have (b) and (c) fixed. Then, we want to find the optimal value for (a). 

$$
\begin{align}
Var(\sum_iw_ix_i) &= \sum_iVar(w_ix_i) \tag {1} \\
&= \sum_iVar(w_i)Var(x_i) \tag {2} \\
&= nVar(w)Var(x) \tag {3} \\
\end{align}
$$

In (2) we used the fact that $w_i$ and $x_i$ are independent and have zero mean [link to prob theory]. In (3) we used the fact that all variables $w_i$ are identically distributed (and the same applies to $x_i.$)

Thus, if we want $Var(\sum_iw_ix_i) = 1,$ then the best value for $Var(w)$ is $1/(n \cdot Var(x)).$ If we know that the input has unit variance, then the best value is $1/n.$

In [32]:
n = 100
trials = []
for _ in range(5000):
    x = np.random.randn(n) * 2 #Var(x) = 4
    w = np.random.randn(n) / 20 #Var(w) = 1/400
    trials.append(sum(w * x))
np.var(trials) #Var(wx) = n * var(x) * var(w) = 100 * 4 * 1/400 = 1

1.0448629656936872

Notice that when we multiply a random variable by $k$, its variance gets multiplied by $k^2.$ To understand this, let's say the mean was $a$ and there was another point at $b.$ Now, those points arrive in $ka$ and $kb.$ The distance between them now is $kb - ka = k \cdot (b - a).$ Before that, the distance was $b - a.$ So, the distance increases by $k.$ But the variance measures the squared distance. Thus, the contribution to the variance for that point is $k^2$ times more than its contribution before. 

#### Inputs
Normalize. Decorrelat

### Cost functions revisited
{TODO: cross entropy}
#### Softmax function
Question: does the softmax change the order of the probabilities?

#### Locally weighted algorithm
The training data is noisy. Thus, we don't want to fit outliers, because they may not represent the real distribution of the data we are trying to model. One definition for the error is
$$E(w) = \sum_i(y_i - w^Tx_i)^2$$
Now, we want to find a way to give less importance to points that are outliers and more importance to points near the mean.
$$E(w) = \sum_i \theta_i(y_i - w^Tx_i)^2$$
For $\theta$ we want an always-positive value. Also, we want a high value if $x_i - mean(x)$ is small, and we want a small value if $x_i - mean(x)$ is high. Note that we care about the absolute value of the difference between $x_i - mean(x)$ that's why we have the squared term.
$$\theta = exp(-(x_i - mean(x))^2)$$
Now, the distribution of the weights is fixed. That is, with the expression above for $\theta,$ we can't change from caring a lot about the values that are near the mean, to caring the same for every value. That's why we add a parameter. Note that as $\tau$ tends to inifinity, $\theta$ tends to 1 and we recover our original erorr term.
$$\theta = exp\bigg(-\frac{(x_i - mean(x))^2}{2\tau^2}\bigg)$$

### Underfitting/overfitting/bias/variance
Overfitting: when the nn captures regularities in the trani data not present in test data.

Natural gradient: when we tune our parameters with gradient descent (the changes being given by backprop), the distribution of the output of our net before and after the gradient descent step could change a lot. That is, the KL Divergence of the net before and after could change a lot or remain very similary. With the natural gradient, we see every change in the parameters that changes the KL divergence by a given constant, and over all those sets of parameters, we select the set that minimizes the loss fn.

### Terms
Level curves: level curves are a way to represent 3D plots in 2D. In these plots, for every value $k$ in a set of values, we draw a curve that goes through every point $x, y$ in the function where $f(x, y) = k.$ It's good if the values in the set have the same distance between them (eg set = {0, 10, 20, 30}) so we can get an idea of the 3D shape. 

Types of function from set A to set B:
* bijection: every point in A is paired with exactly one point in B, and every point in B is paired with one point in A.
* injection: every point in A is paired with exactly one point in B.
* surjection: every point in B is paired with at least one point in A.

Fourier transform: convert a signal to a weighted sum of sines and cosines.

When we have derivative(integral(f)) and the limits of the integral don't depend on the parameters of the functino, then we can swap integral and derivative (derivative(integral(f) = integral(derivative(f))). If they depend on the parameters, then we have to add extra terms. It makes sense because int(deriv(f)) seems to be losing the constant term in comparison with deriv(int(f)).