# Deep Learning
* Neural network with more than one layer
* E.g. images of cats, dogs, trucks, etc.
 * need to reliably identify
* We are dealing with **supervised** learning
 * learning using label data
* $x \to f(\theta, x) \to \hat{y}$
 * all of variables are vectors/matrices
 * $x$ is input data
   * can even be images in form [$R_1, G_1, B_1, R_2, G_2, B_2, \dots$]
 * $\theta$ is all of the learnable parameters for the network layer
   * weigts
   * biases
 * $\hat{y}$ is predicted output
   * 90% cat, 5% dog, etc.
   * does not actually contain class names, only class 0, class 1, etc.
   * vector of k classes
* $f_{net} = f_3(\theta_3, f_2(\theta_2, f_1(\theta_1, x))) \to$ a "3 layer" neural network
* $f = \sum v_i\phi(\vec{w_i}\cdot \vec{x_i} + b_i)$

### Gradient Descent
* Used to find the needed $\theta$ weights
* Done with backpropogation
* We will be looking at Stochastic GD, aka SDG
* Proccess is also called **learning**
 * Thus, **machine learning**
$\newcommand{\Lagr}{\mathcal{L}}$
* Need to define learning with a **loss function** $\Lagr(\hat{y}, y)$
 * good prediction $\to$ low $\Lagr$
 * bad prediction $\to$ high $\Lagr$
 * $\Lagr$ **must be** continuous/differentiable (if we want to successfully learn)
* $x \to f(\theta, x) \to \hat{y} \to \Lagr(\hat{y}, y)$
* Can plot the relationship of $\theta$ and $\Lagr$ (make both of them scalars for simplicity)
 * Slope at any $\theta$ is $\dfrac{d\Lagr}{d\theta}\biggr\rvert_{\theta}$
 * Can go down slope
 * $\theta_{new} = \theta_{old} - \mu\dfrac{d\Lagr}{d\theta}\biggr\rvert_{\theta=\theta_{old}}$
* SDG finds a (laziest) way to optimize $\to$ good at shortcuts
* Only as good as your data
* Can't directly compute minimum
 * hard - many parameters
 * useless - only good for the specific input point

## Regression
* Outputs are numbers (persumably between 0 and 1)
* Represent a value
* Examples:
 * x-position of faceof a person
 * value of a sin(x) function
 * expected lifespan

### Univ. Function Approximator
* Regression
* $x \to f_{NN}(\theta, x) \to \hat y$
* 2-layer "NN"
 * 1-N-1 network
 * Multiplications:
   * (M, 1) x (1, N) x (N, 1) = (M, 1)
   * (M, 1) $\to$ (M, N) $\to$ (M, 1)

## Classification
* Outputs represent classes
 * Higher number = more likelyhood
* Use softmax and cross_entropy
* Examples:
 * Cats/dogs
 * Survived/died in the Titanic incident

### Tendril classificator
* Classification
* 2-layer "NN"
 * 2-N-3
 * Multiplications:
   * (M, 2) x (2, N) x (N, 3) = (M, 3)
   * (M, 2) $\to$ (M, N) $\to$ (M, 3)

## Activation functions
* Sigmoid
* Tanh
* ReLU
* Softmax

## Loss functions
* L1 loss: $\sum \limits_{i=1}^N|\hat {y_i} - y_i|$
* L2 loss: $\sum \limits_{i=1}^N(\hat {y_i} - y_i)^2$ (similar to MSE)
* cross-entropy: $−\sum \limits_{i=1}^N y_i\log(\hat {y_i})$ (goes well with softmax)

# Neural Networks
* Can think of previous examples as 2-layer neural networks
* Each neuron performs $\phi(\vec x \cdot \vec{w_i} + b_i)$
* Dense layers
 * all neurons see all neurons from their preceding layer
* Can have way more than 2 layers!
 * Example with 2-12-6-3
   * (M, 2) x (2, 12) x (12, 6) x (6, 3) = (M, 3)
   *   $x$  x  $W_1$  x  $W_2$  x  $W_3$ = $\hat y$
* Can have various activation functions
 * using all linear activation functions causes network to collapse
 * reLu allows layers to ignore some inputs
   * still learn fast while strongly wrong

### Image classification
* We aren't bound to a small dimensionality of data
* Can use enough dimensions for an entire image!!!
 * cifar-10: 32x32x3 = 3074 dimensions!!!
 * can flatten the image to make each a vector in $\mathbb{R}^{3074}$ space
* **But**, there's a caveat...
 * Having high dimensionality causes many connections
 * Many connections cause the netowrk to learn worse
 * So, what do we do?..

# Convolutional Neural Networks

### Before:
 * $x \cdot W \to$ (1, 3072) x (3072, 10) = (1, 10)
 * equiv. to $(\vec x \cdot \vec{w_0}, \vec x \cdot \vec{w_1}, \vec x \cdot \vec{w_2}, \dots)$
   * Dot product describes similarity of vectors
   * $\vec x \cdot \vec y = \lVert\vec x\rVert\lVert\vec y\rVert\cos\theta$
   * In a 1-layer, weights would represent a class (i.e. cat, dog) mask
   * How much of it is a cat/dog?

## Convolutions
* technically, autocorrelations, but eh..
* Move **kernel**
\begin{bmatrix}
    3&8 \\
    1&0 \\
\end{bmatrix}
* Over **image**
\begin{bmatrix}
    0&2&3&6&9 \\
    8&4&1&6&2 \\
    1&0&2&2&8 \\
    8&6&0&6&3 \\
    1&5&4&9&1 \\
\end{bmatrix}
* Kernel at (0,0)
 * $3\cdot0 + 8\cdot2 + 1\cdot8 + 0\cdot4 = 24$
* Kernel at (0,1)
 * $3\cdot2 + 8\cdot3 + 1\cdot4 + 0\cdot1 = 34$ 
* Kernel at (1,0)
 * $3\cdot8 + 8\cdot4 + 1\cdot1 + 0\cdot0 = 57$ 
* Kernel at (1,1)
 * $3\cdot4 + 8\cdot1 + 1\cdot0 + 0\cdot2 = 20$
* $\dots$

In [12]:
import numpy as np
from mygrad.nnet.layers.utils import sliding_window_view

img = np.array([[0,2,3,6,9],
                [8,4,1,6,2],
                [1,0,2,2,8],
                [8,6,0,6,3],
                [1,5,4,9,1]])
kernel = np.array([[3, 8],
                   [1, 0]])
views = sliding_window_view(img, window_shape=(2, 2), step=(1, 1))
np.sum(views * kernel, axis=(2,3))

array([[24, 34, 58, 96],
       [57, 20, 53, 36],
       [11, 22, 22, 76],
       [73, 23, 52, 51]])

* Can also skip over some with stride
* Ouput shape: $\dfrac{x-w}{s}+1$
 * has to be an integer
* Stride and kernel can also be rectangular
* MyGrad can preform convolutions in **N dimensions!**
* $(N, C, H, W) \circledast (F, C, H_f, W_f) \to (N, F, H', W')$
 * $N$ is number of images
 * $C$ is number of channels
 * $F$ is number of different kernels
 * $H$ is image height
 * $W$ is image width
 * $H_f$ is kernel height
 * $W_f$ is kernel width
 * $H' = \dfrac{H-H_f}{s}+1$
 * $W' = \dfrac{W-W_f}{s}+1$
 * Each kernel has C channels $\to$ channel dimension collapses
 * Replaced by F dimension
* F dimension essentially replaces the number of neurons

### Backpropagation
* Mathematically, dependent on convoled parameters
* Practically, slightly wiggle parameters and observe changes
* Results line up!

## Maxpooling
* Same as convolutions, but returns max
* One value over each channel
* Typically done with strides, so dimensionality is reduced
 * i.e. a 2x2 maxpooling layer will output $\dfrac{x-2}{2}+1 = \dfrac{x}{2}$
* Typically follow convolutions
 * don't have to, can be entirely replaced with strided convolutions