# $$Weights$$

$$y = Wx$$

$$W = \begin{bmatrix}
\cdots & W_{1}^T & \cdots \\
\cdots & W_{2}^T & \cdots \\
\vdots & \ddots & \vdots \\
\cdots & W_{n}^T & \cdots \\
\end{bmatrix}$$

- There is one row in **W** for each unit in the layer. The same as the number of outputs of this layer.
- The number of elements in **w** is the number of **input(x)**. This is the number of inputs to this layer.


<img src="https://github.com/Sayan-Roy-729/images/blob/main/deep_learning/11_weights/images/image-1.png?raw=true" style="border-radius: 10px"/>

**Code:**
- [Part 1 - Explanation of Weight Matrix Sizes](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%2011%20-%20Weights/Part%201%20-%20Explanation%20of%20Weight%20Matrix%20Sizes.ipynb)

## A Surprising Demo of Weight Initializations

**Take-Home Message:**
- Models cannot learn when all trainable parameters are initialized to the same value.
- Models can learn as long as some trainable parameters are initialized to different numbers.
- Why is this??!!??!!

**Code:**
- [Part 2 - A Surprising Demonstration of Weight Initializations](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%2011%20-%20Weights/Part%202%20-%20A%20Surprising%20Demonstration%20of%20Weight%20Initializations.ipynb)

## Theory: Why and how to initialize weights
>**Weight Initializations:**
- Complex, multidimensional landscapes have few local minima.
- Initializing the weights to be random numbers means that we are exceedingly unlikely to start in or near a local minima (for DL).
- Random initialization provides the necessary rough texture for g.d. to move.
- Analogy: Walk downhill in Bonneville salt flats vs Badlands park.
- Random weights allow for any direction, even if it's (initially) the wrong one.
- From engineering: "Stochastic facilitation"

>**Weight initialization: the algebrac view**
- If all weights are the same, there is no diversity.
- With random weights, some will be strengthened, some weakened, some by a lot and some by a little.
- Equal weights is called "weight symmetry" and thus randomizing weights is "breaking symmetry".

>**So, how to initialize the weights?**
- Random numbers drawn from a normal (Gaussian) distribution.
- With random weights, some will be strengthened, some weakened, some by a lot and some by a little.
- Standard deviation should be relatively small.
- Small weights (close to zero) increase risk of vanishing gradients.
- Large weights increase risk of exploding gradients.
- Solution: Set the variance of the weights proportional to the size of the network.
- Initializing biases is less important that weights.

![Methods](https://github.com/Sayan-Roy-729/images/blob/main/deep_learning/11_weights/images/image-2.png?raw=true)

![Methods](https://github.com/Sayan-Roy-729/images/blob/main/deep_learning/11_weights/images/image-3.png?raw=true)

>**Do weights initializations matter?**
- For relatively simple models that are easy to train: No, not really. Just break symmetry and the model will be fine.
- For very large, complex models with billions (or more!) parameters, weight initialization can be important.
- Optimal weight initialization strategy is an active area of research in DL.

>**Code:**
- [Part 3 - CodeChallenge Weight Variance Inits](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%2011%20-%20Weights/Part%203%20-%20CodeChallenge%20Weight%20Variance%20Inits.ipynb)
- [Part 4 - Xavier and Kaiming Initializations](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%2011%20-%20Weights/Part%204%20-%20Xavier%20and%20Kaiming%20Initializations.ipynb)
- [Part 5 - CodeChallenge CodeChallenge: Xavier vs. Kaiming](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%2011%20-%20Weights/Part%205%20-%20CodeChallenge%20CodeChallenge%20Xavier%20vs.%20Kaiming.ipynb)
- [Part 6 - CodeChallenge Identically random weights](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%2011%20-%20Weights/Part%206%20-%20CodeChallenge%20Identically%20random%20weights.ipynb)

## Freezing Weights During Learning:

**Freezing Weights:**
- Freezing a layer means to switch off gradient descent in that layer.
- This means that the weights will not change, and thus the layer will no longer learn.
- Think of the parameter `requires_grad` as a toggle that switches on (`True`) of off (`False`) learning in for that weight matrix (or bias).

**Why would you want to switch off learning?!!**
- The main application is when working with downloaded (pretrained) networks.
- The network is already trained, but you need to fine-tune it for your specific dataset.
- This is called `transfer learning`.

**Code:**
- [Part 7 - Freezing Weights During Learning](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%2011%20-%20Weights/Part%207%20-%20Freezing%20Weights%20During%20Learning.ipynb)

## Learning-Telated Changes in Weights

>**Changes in weights matrices over time**
- Obviously, the weights change over time (that's the whole point!)
- How to quantify those changes?
- Shape and width of distribution (histograms)
- General changes over all weights per layer

>**Metric 1: Euclidean distance to previous**

$$d = \sqrt{\sum_{i=1}^{M}\sum_{j=1}^{N}(W_{ij}^{(t)} - W_{ij}^{(t-1)^2})}$$

- **Explanation in words:** Subtract two matrices, square each element, sum the entire matrix, take the square root.
- **What it means:** Larger distances mean the weights are changing rapidly (a lot of learning). Small distances (close to zero) mean the weights change very little (little learning).

>**Metric 2: Condition number**

$$\kappa = \frac{\sigma_{max}}{\sigma_{min}}$$
- **Explanation in words:** Compute the SVD, take the ratio of largest to smallest singular values.
- **What it means:** Larger condition numbers indicate sparser matrices, meaning some directions are spacious while others are thin. It means the network learned specific features. Large condition numbers indicate sparse representations, or possible overfitting.

![Condition number of a matrix](https://github.com/Sayan-Roy-729/images/blob/main/deep_learning/11_weights/images/image-4.png?raw=true)

>**Code:**
- [Part 8 - Weight Characteristics During Learning](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%2011%20-%20Weights/Part%208%20-%20Weight%20Characteristics%20During%20Learning.ipynb)