- **Remember:** Модель - это отображение из пространствва признаков в пространство таргетов.
- NN доминируют в задачах с структурированными данными, то есть в данных наблюдается какая структура. Например, 
    - Изображения (есть порядок между пикселями).
    - Обработка истетственного языка (есть порядок между словами).
    - Текст (есть порядок)
    
    
- Example of non-structural data: Titanic. In this case mat be better and faster to train a RF or Gradient Boosting instead of NN.

# 0. Network Structure

![alt text](https://cs231n.github.io/assets/nn1/neural_net2.jpeg)

All connection strengths for a layer can be stored in a single matrix. For example, the first hidden layer’s weights `W1` would be of size `[4x3]`, and the biases for all units would be in the vector `b1`, of size `[4x1]`. Here, every single neuron has its weights in a row of `W1`, so the matrix vector multiplication `np.dot(W1,x)` evaluates the activations of all neurons in that layer. Similarly, `W2` would be a `[4x4]` matrix that stores the connections of the second hidden layer, and `W3` a `[1x4]` matrix for the last (output) layer.

For **each neuron** in the $i$-th fully connected layer we have that its input size is $\text{n_input}_i$ and its output size is $1$ (because of the activation function).

But we usually forward passing is done through the whole $i$-th layer using matrix-vector operations.


$\implies$ that for the **$i$-th layer as a whole**, its input size is $\text{n_input}_i$ and output size is $\text{n_ouput}_i$, where $\text{n_output}_i =$ number of neurons in the $i$-th layer.


- That is, for a given $i$-th layer $$\text{W}_i\text{.shape} = (\text{n_output}_i, \text{n_input}_i) = (\text{n_input}_{i+1}, \text{n_output}_{i-1})$$.
    - Of course, for a fully connected layer,each neuron gets the same number of inputs.


- The original data set $X = X^{(0)}$ (the input layer is not counted) has shape $(\text{n_instances}, \text{n_features}) = (\text{n_instances}, \text{n_input}_0)$.


- If $X^{(i-1)}$ is the dataset transformed after passing through the $(i-1)$-th layer, then $$\text{X}^{(i-1)}\text{.shape} = (\text{n_instances}, \text{n_input}_i) = (\text{n_instances}, \text{n_output}_{i-1})$$


- Therefore the transformation at the $i$-th layer has to be of the form $$X^{(i)} = X^{(i-1)}W_i^{\top} + b_i$$ where $$X^{(i-1)}W_i^{\top}\text{.shape} = (\text{n_instances}, \text{n_input}_i)(\text{n_input}_i, \text{n_output}_i)$$


- $b_{ij}$ - bias term at the $j$-th neuron in the $i$-th layer. It's just a number.$ \implies b_i = (\text{number of neurons in the }i\text{-th layer}, 1) = (\text{n_ouput}_i, 1)$.

# 1. Until now.

- Until know we are the ones who have to come up with feature engineering and feature extraction:

![alt text](https://i.ibb.co/CzWWhkk/Screen-Shot-2020-11-02-at-18-24-14.png)

---

# 2. What we want.

- But we'd want that feature extraction could have done automatically. This means that feature extraction has to be parameterized:

![alt text](https://i.ibb.co/KWqvZQN/Screen-Shot-2020-11-02-at-18-26-39.png)

---

# 3. How to parameterized the feature extraction?
- This can be done by using a linear model + sigmoid function:
    ![alt text](https://i.ibb.co/tJ9Q4vC/Screen-Shot-2020-11-02-at-18-28-23.png)

- That is, we make a nonlinear transformation between the first and second model.

---

# 4. How can be perform a nonlinear transformation?

- For this are used so-called activation functions.

![alt text](https://i.ibb.co/PNWh3Tg/Screen-Shot-2020-11-02-at-18-33-07.png)

---

# 5. Notation.

- NN - это последовательность преобразований, которая из исходного признакового пространства преобразует объект какое-то целевое а потом в метку классов.


- Layer – a building block for NNs :
    - Dense/Linear/FC layer: $f(x) = Wx+b$.
    - Nonlinearity layer: $f(x) = \sigma(x)$.
    - Input layer - представление данных в исходных признаках.
    - Оutput layer - представление данных в целевом виде. Например выходных слой порождает метку классов.
    - A few more we will cover later.


- Activation function – function applied to layer output
    - sigmoid.
    - $\tanh$.
    - $\mathrm{ReLU}$.
    - Any other function to get nonlinear intermediate signal in NN.


- Backpropagation – a fancy word for the chain rule.

# 6. Backpropagation.

- Каждый последующий слой получен каким-то преобразованием над выходом предыдущих слоев. То есть между слоями у нас есть функция преобразования, которая зависит от каких-то параметров $\implies$ есть шанс, что такие параметры сможем их выучить. 
- Чтобы их выучить мы воспользуемся методом обратного распространения ошибки.
- Позволяет нас шаг ха шагом считать градинеты для каждого из слоев нашей сети.

![alt text](https://i.ibb.co/gVwftcR/Screen-Shot-2020-11-02-at-23-26-08.png)
![alt text](https://i.ibb.co/bPSm7Yc/Screen-Shot-2020-11-02-at-22-08-50.png)



Backpropagation in the case of linear regression:

![alt text](https://i.ibb.co/fDhFR2r/Screen-Shot-2020-11-02-at-23-31-16.png)


---


# 7. Backpropagation Example 1.

![alt text](https://i.ibb.co/BPZx2dH/Screen-Shot-2020-11-09-at-22-53-37.png)
![alt text](https://i.ibb.co/bbH5NgJ/Screen-Shot-2020-11-09-at-22-53-53.png)

![alt text](https://i.ibb.co/X3p6v8h/Screen-Shot-2020-11-02-at-23-45-30.png)
![alt text](https://i.ibb.co/9rn8t4x/Screen-Shot-2020-11-02-at-23-46-41.png)

- Why does it matter that we known the numeric values of the gradients?
    - Because since our function is differentiable, we know the analytic solution of each of the performed transformations.
    
    
- What happens if $x$, $y$ and $z$ are vectors?
    - We do the same thing.
    
---


# 8. Backpropagation Example 2.

![alt text](https://i.ibb.co/8Psr8qy/Screen-Shot-2020-11-09-at-22-34-59.png)

- Backward red arrows: the gradient of passed-by node.
- Let at the $k$-th node perform the operation $f_k$, giving the value $v_k$. Then by construction $f_k(v_{k-1}) = v_k$.
- Let $d_j$ be the derivative value after node $j$ (traversing from left to right).
    - If $n$ is the last node, then $d_n = \frac{df}{df} = 1$.
- Backpropagation from node $i$ to node $i-1$ (for $i = [n,n-1, n-2,\ldots, 1]$):
    - We know the derivative value after node $i$: $d_i$.
    - Backward red arrow passing by node $i$ is just $f'_i$.
    - Calculate $f'_i(v_{i-1})$.
    - Then $d_{i-1} = f'_i(v_{i-1})\cdot d_i$.
    
    - **Example:**
        - $d_n = 1$.
        - $f'_n = -\frac{1}{x^2}$.
        - $v_{n-1} = 1.37$.
        - $f'_n(v_{n-1}) = -\frac{1}{1.37^2} = -0.53$.
        - $d_{n-1} = f'_n(v_{n-1})\cdot d_n = -0.53 \cdot 1 = -0.53$.
        
    - **Notes:**
        - If node $i$ is $\cdot (-1) \implies$ node $i$ is $-x \implies f_i = -1$.
        - If node $i$ is $+ \implies$ node $i$ is $x + a \implies f_i = 1$.
        - If node $i$ is $* \implies$ node $i$ is $x\cdot a \implies f_i = a$.
            - $*$ is a binary operation, i.e. $\text{left} * \text{right}$ so:
                - $d_{\text{left}} = f'_i(v_{\text{left}})\cdot d_i = v_{\text{right}}\cdot d_i$.
                - $d_{\text{right}} = f'_i(v_{\text{right}})\cdot d_i = v_{\text{left}}\cdot d_i$.
                - We swap the values (green arrows).

![alt text](https://i.ibb.co/DWM0kQq/Screen-Shot-2020-11-03-at-01-25-24.png)
![alt text](https://i.ibb.co/yfVsPsS/Screen-Shot-2020-11-03-at-01-25-30.png)

### <font color=green>WHY DOING BACKPROPAGATION IS USEFUL?</font>

---


# 9. Why do we need some sequential transformation?

Поскольку 

1) Каждое предыдующее преобразование порождает более сложное признаковое пространство, так как мы взяли исходное и нелинейным образом мы его преобразовали.

2) Если мы подбираем параметры этих преобразования каким-то хорошим образом, то мы сможем себе найти более информативное признаковое пространство.

---

# 10. Activation Functions: Recap.

- Функции активации применяются после линейного или квазилинейного преобразования над исходными данными.
- Функция активации дает нам нелинейность. **<font color=red>Why don't we parameterize them?</font>**

---

## 10.1. Sigmoid Function.

![alt text](https://i.ibb.co/f8x4Lqg/Screen-Shot-2020-11-03-at-01-44-45.png)


**Other Cons:**

1) It kills the gradients in the sense that if we get as too positive or two negative values of $z$, then $\sigma(z)$ tends to $1$ or $0 \implies \sigma'(z) \to 0$.

2) NN are basically linear models with nonlinear activation between layers. As such are sensitive to feature scaling and it's recommended to normalize the data before training, that is, that the data were centralized. Since sigmoid is a nonnegative function, it tends to shift the mean towards positive values.

---

## 10.2. $\tanh$.

![alt text](https://i.ibb.co/PxBz9BG/Screen-Shot-2020-11-03-at-01-59-03.png)

---

## 10.3. ReLU.

![alt text](https://i.ibb.co/tcZNKW0/Screen-Shot-2020-11-03-at-02-00-53.png)


**Other Cons:**

- For negative values of the argument it kills the gradient, since the gradient for $z \leq 0$ is $0$.

---

## 10.4. Leaky ReLU.

![alt text](https://i.ibb.co/m5f538N/Screen-Shot-2020-11-03-at-02-04-36.png)

---

## 10.5. Parametric ReLU.

![alt text](https://i.ibb.co/z54SwMD/Screen-Shot-2020-11-03-at-02-05-44.png)

- $\alpha$ is a hyperparameter.
- It's not that often used.

---

## 10.6. Exponential Linear Units (ELU).

![alt text](https://i.ibb.co/HBJHhbJ/Screen-Shot-2020-11-03-at-02-07-00.png)

---

# 11. Activation Functions: Conclusions.

- Use ReLU as baseline approach.
- Be careful with the learning rates.
- Try out Leaky ReLU or ELU.
- Try out tanh but do not expect much from it.
- Do not use Sigmoid.