# Hidden Unit Dynamics

### Encoder Networks:

![](images/encoder-identity.png)

The goal for this network is to recreate the same input vector through the bottleneck hidden layer.


#### HIDDEN UNIT SPACE TODO:

We're trying to take the weights of the neural network and plot them in a graphical way.

Consider a 5-2-5 encoder network. From 5-2, we have 10 weights, plus 2 biases. From 2-5, we have 10 weights, plus 5 biases. That's 27 weights in this network.

We can write them out as a list, but we can get a better understanding when we try to graph it.

Graphing the hidden unit activations. 

Eg. if we have sigmoid, then the graph spans from 0-1, if we have tanh, then all hidden unit activations are bound from -1 to 1

The dots represent input to hidden weights?

The line is the boundary between when the linear combination (weighted sum?) is positive or negative. Lines represent the hidden to output weights.

After the network is trained, each line should divide one point from the other points. The points get as far away from each other as possible.

Is each axis representing the 2 hidden node activations? So with 8-3-8 encoders, there would be 8 points representing the input to hidden weights and planes that separate the points

Each output node can be represented as a line in 2d HU space


### Weight Space Symmetry:

- Swapping any pair of hidden nodes does not affect the overall network performance
- Reversing the sign of all incoming and outgoing weights in any hidden node won't affect overall network performance (assuming the use of a symmetric transfer function like $\tanh$)
- Hidden nodes tend to do a similar job initially, then gradually specialise

The first two points above gives rise to a very large number of other possible weight configurations that would all perform equally well.


### Controlled Nonlinearity:

- For smaller weights, each layer implements an approximately linear activation function, so multiple layers also implement an approximately linear activation function
    - This means that the stacking of multiple layers would be almost equivalent to a single layer network (no matter how many layers we have, if all outputs are linear, the final output is nothing more than a linear function of the first layer's input)
- For larger weights, the activation function approximates a step function which makes learning very slow - because the gradient is approximately 0, so gradient descent causes very small updates in weight
    - Just switching from a sigmoid activation function to the ReLU activation function can make gradient descent work much faster
- For typical weights, the activation function is close to linear, but takes advantage of a limited degree of nonlinearity

<img src="images/tanhx-and-x.png" width="50%"/>



### Vanishing/Exploding Gradient: 

- With a low learning rate, as you backpropagate, the gradients diminish exponentially across each layer from output to input
    - Stacking more layers means the weight updates are concentrated on the layers closer to the output layer rather than the layers closer to the input layer.
- With a high learning rate, the problem reverses where the gradients grow exponentially across each layer from output to input

#### Avoiding vanishing/exploding gradient:
- Using different activation functions like $\texttt{relu}$ or $\texttt{selu}$:
    <table>
      <tr>
        <td> <img src="images/relu.png"></td>
        <td><img src="images/selu.png"></td>
       </tr>
    </table>

- Weight initialisation
- Batch normalisation
- Long short term memory
- Layerwise unsupervised pre-training 




### Avoiding Overfitting:

- Early stopping — stopping the training when the neural network has achieved a minimal error on the testing or validation dataset 
    <table>
        <tr>
            <td> 
                <img src="images/error-vs-weight-updates-1.png"/>
            </td>
            <td>
                <img src="images/error-vs-weight-updates-2.png"/>
            </td>
        </tr>
    </table>
    Experimenting with the number of hidden nodes and number of weight updates and plotting graphs like this helps us decide a good network structure and number of training epochs
- Weight decay
- Dropout — randomly choosing a subset of nodes to not be used for each mini- batch. Each node is chosen to be 'dropped' from training with a fixed probability $\pi$ you choose
    - When the network is deployed, all nodes are active, but all the activation functions are multiplied by a factor $1 -\pi$ (?). This is so that the overall contribution of each node in total is equal on average to the total contribution when the network had dropped out nodes
    - Inspired by biological neural networks where neurons may 'drop out' and other neurons have to 'fill' in for their contribution to the output
    
- Ensembling — training multiple different networks on the same task and then making a prediction as a majority vote across the ensemble of networks.
    - Bagging — each network can be trained on the same 'bag' of datapoints with replacement. For one network, some items will be trained on multiple times or none at all, and this will be different for each network, giving rise to diversity across each network 
    - Dropout is implicitly a form of ensembling because on deployment, it's like the combination of multiple different architectures