# 1. Modern Regularization Techniques
We are now going to look at modern regularization techniques such as drop out. Dropout is a technique developed by Geoff Hinton and team. Up until we have worked with the following types of regularization:

<br>
## 1.1 L1 and L2 Regularization
**L1 regularization** encourages weights to be sparse, aka most weights being equal to 0. **L2 regularization** encourages most weights to be small, aka approximately equal to 0. 

#### $$L_{regularized} = L_{likelihood} + \lambda|\theta|^p$$

<br>
## 1.2 Dropout 
In contrast to L1 and L2 regularization, **Dropout** does not add any penalty term to the cost. Instead, it works in a totally different way. Specifically, dropout works by dropping random nodes in the neural network during training. This has the effect of making it so that any hidden layer unit cannot just rely on 1 input feature, because at any time that node could be dropped. 

<img src="images/dropout.png">

We will see that dropout emulates an ensemble of neural networks. What exactly is meant by ensemble? Well, remember how we mentioned we would be dropping some nodes in the neural network. We can imagine that instead of just randomly dropped nodes during training, we could actually create several instances of neural networks with these different structures and train them all. Then, to calculate the final prediction, we could average the predictions of each individual neural network. 

<br>
**Pseudocode**
```
prediction1 = neuralNetwork1.predict(X)
prediction2 = neuralNetwork2.predict(X)
prediction3 = neuralNetwork3.predict(X)

finalPrediction = mean(prediction1, prediction2, prediction3)
```

---

<br>
# 2. Dropout 
We are now going to dig further to see exactly how dropping nodes randomly in a neural network performs regularization, and how it emulates an ensemble of neural networks. 

<br>
### 2.0 Ensembles  
First, let's quickly discuss ensembles. The basic idea is that by using a group of prediction models that are different, and then taking the average or a majority vote we can end up with better accuracy than if we had just used 1 prediction model. So, what do we mean by different? 

<br>
**Method 1**
<br>
One easy way is to just train on different subsets of random data. This is also good if your algorithm doesn't scale. As an example, say you have 1 million training points, but you train 10 different versions of a decision tree on only 100,000 points, then that would be an ensemble of decision trees. Then, to get a prediction you take a majority vote from these 10 different decision trees. 

<br>
**Method 2**
<br>
Another method is to not use all of the features. So, if we have 100 features, maybe each of the 10 decision trees will only look at 10 different features each. The result is that instead of training 1 decision tree on a 1 million x 100 matrix X matrix, we train 10 decision trees on 100k x 10 matrices, which are all sampled from the original matrix. Miraculously, this results in better performance that just training 1 decision tree. Dropout is more like this method. 

---

<br>
## 2.1 Dropout
So, how exactly does dropout work? Well, as we said, we are only going to use a subset of features. However, we are not only going to do this at the input layer, we are going to do this at every layer. At every layer, we are going to choose randomly which nodes to drop. We use a probability $p(drop)$ or $p(keep)$ to tell us the probability of dropping or keeping a node. Typical values of $p(keep)$ are 0.8 for the input layer, and 0.5 for the hidden layers. 

Note, we only drop layers during training. Also, notice that when we discussed ensembling there were 10 decision trees. However, with dropout there is still only 1 neural network. This is because we have only talked about training up until now. The other part is of course prediction. 

### 2.1.1 Dropout - Prediction
The way that we do prediction is that instead of dropping nodes, we multiply the output of a layer by its $p(keep)$. This effectively shrinks all of the values at that layer. In a way, that is similar to what L2 regularization does. It makes all of the weights smaller so that effectively all of the values are smaller. 

<br>
**Pseudocode**
<br>
```
# Prediciton
# If we have 1 hidden layer: X -> Y -> Z
X_drop = p(keep | layer1) * X
Z = f(X_drop.dot(W) + b)
Z_drop = p(keep | layer2) * Z
Y = softmax(Z_drop.dot(V) + c)

# So we can see that we are multiplying by the p(keep) at that layer
# This shrinks the value. L2 regularization also encourages weights to be small,
# Leading to shrunken values
```

Let's think about what this ensemble represents. If we have in total $N$ nodes in the neural network, that is just:

#### $$N = number \; input \; nodes + number \; hidden \; nodes$$

Each of these nodes can have 2 states: **On** or **Off**, **Drop** or **Keep**. So, that means in total we have:

#### $$possible \; neural \; networks = 2^N$$

Therefore, we are approximating an ensemble of $2^N$ different neural networks. Now, imagine the case where you were not doing an approximation. Let's take a very small neural network, only 100 nodes (N = 100). Keep in mind that is a very small neural net. For comparison, MNIST would have ~1000 nodes total; 700 for the input and 300 for the hidden layer. Anyways, if we only had 100 nodes, we would still have:

#### $$2^{100} \approx 1.3 * 10^{30}$$

Imagine training that many neural nets? It would clearly be infeasible. So, we can't actually do that ensemble, however, mutliplying by $p(keep)$ allows us to approximate it. 

---
<br>
## 2.2 Dropout - Implementation Theano
The basic approach in theano to implement dropout, instead of actually dropping nodes out of the neural network, which would result in a different computational graph which theano wouldn't be able to handle, we are instead just going to multiply by 0. This has the same effect as dropping a node, because anything that comes after it will be mutliplied by 0 which is still 0. Since at each layer we are going to have an $N x D$ matrix, where N is equal to the batch size, we need to create an $N x D$ matrix of 0s and 1s to multiply that layer by. We call this matrix a **mask**. 

Now, recall that when you define the computational graph in theano you are not using real values. You are just specifying which nodes are connected to which other nodes. This means that we cannot multiply by a random numpy matrix in there, because that is effectively constant. If we randomly generate a 1 in the theano graph, and it is a numpy 1, then it is always going to be a 1 when we call the theano function to get the output of that graph. So, this would not work:

<br>
**Incorrect Pseudocode**
<br>
```
mask = np.random.binomial(...)
Z = f((X * mask).dot(W) + b) 
```

Instead, what we want to do