# 1. Hyperparameter Optimization 
We have just introduced a lot of new hyperparameters, and this is only going to get worse as our neural networks get more and more complex. This leads us in to a topic called **hyperparameter optimization**. So, what are the hyperparameters that we have learned about so far? 

>* **Learning Rate** or if it is adaptive, **initial learning rate** and **Decay rate** <br>
* **Momentum** <br>
* **Regularization weight**<br>
* **Hidden Layer Size**<br>
* **Number of hidden layers**<br>

So, what are some approaches to choosing these parameters? 

## 1.1 K-Fold Cross-Validation
Let's go over **K-Fold Cross Validation** to review. The idea is simple, we do the following:
> 1. Split data into K parts
2. Do a loop that iterates through K times
3. Each time we take out one part, and use that as the validation set, and the rest of the data as the training set

So in the first iteration of the loop, we validate with the first part, and train on the rest. 

<img src="images/k-folds-validation.png">

Here is some pseudocode that can do this:
```
def crossValidation(model, X, Y, K=5):
    X, Y = shuffle (X, Y)
    sz = len(Y) / K
    scores = []
    for k in range(K):
        xtr = np.concatenate([ X[:k*sz, :], X[(k*sz + sz):, :] ])
        ytr = np.concatenate([ Y[:k*sz], X[(k*sz + sz):] ])
        xte = X[k*sz:(k*sz + sz), :]
        yte = Y[k*sz:(k*sz + sz)]
        
        model.fit(xtr, ytr)
        score = model.score(xte, yte)
        scores.append(score)
    return np.mean(scores), np.std(scores)
```

Now, we can see that this algorithm contains **K** different scores. We can simply use the mean of these of these scores as a measurement for how good this particular hyperparameter setting is. Another thing we could do is a statistical test to determine if the difference between two hyperparameter settings is statistically significantly better than the other.

## 1.2 Sci-Kit Learn K-Folds
Sci-Kit learn has its own K-folds implementation that is great to use:

```
from sklearn import cross_validation
scores = cross_validation.cross_val_score(model, X, Y, cv=K)
```

Note that the SKLearn implementation does require you to conform to certain aspects of the SKLearn API. For example, you must provide a class with at least the 3 methods `fit()`, `predict()`, and `score()`. 

## 1.3 Leave-One-Out Cross-Validation
One special variation of K-Folds cross-validation is where we set K = N. We will talk more about this later. But the basic idea is:
> 1. We do a loop N times 
2. Every iteration of the loop we train on everything but one point 
3. We test on the one point that was left out
4. We do this N times for all points 

Now with all of that discussed, what are some of the different approaches to hyperparameter optimization? 

---

<br>
## 1.4 Grid Search
Grid search is an exhaustive search. This means that you can choose a set of learning rates that you want to try, choose a set of momentums you want to try, and choose a set of regularizations that you want to try, at which point you try every combination of them. In code that may look like: 

```
learning_rates = [0.1, 0.01, 0.001, 0.0001, 0.00001]
momentums = [1, 0.1, 0.01, 0.001]
regularizations = [1, 0.1, 0.01]

for lr in learning_rates: 
    for mu in momentums:
        for reg in regularizations:
            score = cross_validation(lr, mu, reg, data)
```

As you can imagine, this is **very** slow! But, since each model is independent of the others, there is a great opportunity here for a **parallelization**! Frameworks like **hadoop** and **spark** are ideal for this type of problem. 

---

<br>
## 1.5 Random Search
On the other hand we have **random search**, which instead of looking at every possibility, just moves in random directions until the score is improved. A basic algorithm could look like: 

```
theta = random position in hyperparameter space
score1 = cross_validation(theta, data)
for i in range(max_iterations):
    next_theta = sample from hypersphere around theta
    score2 = cross_validation(next_theta, data)
    if score2 is better than score1:
        theta = next_theta
```

---
<br>
# 2. Sampling Logarithmically 
Let's now talk about how to sample random numbers when performing random search. It is not quite as straight forward as you may assume. 

## 2.1 Main Problem
Suppose we want to randomly sample the learning rate. We know that the difference between 0.001 and 0.0011 is not that significant. In general, we want to try different numbers on a log scale, such as $10^{-2}$,$10^{-3}$,$10^{-4}$, etc. So if you sample between $10^{-7}$ and $10^{-1}$ uniformally, what is going to happen? 

Well if we look at the image below, we can see that most of what we would sample is on the same scale as $10^-1$, while everything else is under represented!

<img src="images/sampling-scale.png">

So, how can we fix this problem? Well, we will sample on a **log scale**! That way we will get an even distribution between every 10th power, which is exactly what we want! Algorithmically this may look like:

```
Sample uniformly from (-7, -1) from a uniform distribution # Or whatever limits you want 
R ~ U(-7, -1)
Set your learning rate to be 10^R
```

This can also be used for hyper parameters like decay where we want to try numbers like 0.9, 0.99, 0.999, etc! 

It may not seem intuitive that these numbers are still on a log scale, but if we rewrite those numbers as:
#### $$0.9 = 1 - 10^{-1}$$
#### $$0.99 = 1 - 10^{-2}$$
#### $$0.999 = 1 - 10^{-3}$$

We can see that they are indeed in fact still on a log scale! These will give very different results, so being able to sample them effectively is very important. Algorithmically it may look like:

```
R ~ U(lower, upper)
Decay Rate = 1 - 10^R
```