# 1. Hyperparameter Tuning

## 1.1 Tuning Process

**Importance to Tune:** 
1. Learning rate
2. $\beta$, #hidden units, mini-batch size
3.  #layers, leraning rate decay
4. Adam parameters (using defaults) 

Searching Tips: 
- Dont' use a grid searchign rather than randomly select
- Coarse to fine 
    - Search in the coarse area first, then focus on the fine area to have a dense searching


## 1.2 Using an Appropriate Scale to Pick Hyperparameters

**How random is it?**
- Uniformly rando, i.e., linear scale (X)
- Log scale is better

```python
# sample learning rate
r = -4 * np.random.rand()
learning_rate = 10**r

# sample beta in EWA
r = -3 * np.random.rand()
beta = 10**r

```

* Goal is to smaple the hyperparameters more efficiently. Uniform random is not too bad, but using a log scale can make the searching faster.


## 1.3 Hyperparameters Tuning in Practice: Pandas vs. Caviar

- Re-test hyperparameters occasionally 
- Babysitting one model 
    - everyday trains a small model and keep tuning
- Training many models in parallel
    - find one that performs best 

![](./imgs/pandas-caviar.jpg)


# 2. Batch Normalization

## 2.1 Normalizing Activations in a Network 

**Batch Norm:**
- Normalize the activation inputs ($Z^{[l]}$) so as to train the parameters in the next layer faster. 

- for a certain layer $l$ : 

$$ u = \frac{1}{m} \; \sum_{i} \; z^{(i)} $$
$$ \sigma^{2} = \frac{1}{m} \; \sum_{i} \; (z_{i} - u)^{2} $$
$$ z_{norm}^{(i)} = \frac{z^{(i)} - u}{\sqrt{\sigma^2 + \epsilon}} $$

$$ \tilde{z}^{(i)} = \gamma \; z_{norm}^{(i)} + \beta $$

where $\gamma, \beta$ are learnable parameters of models 

**Note:**

- In some deeper layers, we want the values are still can learn faster 
- But, at the same time, we do not want the mean and the variance of the deeper outputs are forced to be 0 and 1. Thus, we add two learnable parameters so that each hidden unit values can learn different features.

## 2.2 Fitting Batch Norm into a Neural Network

**$\beta^{[l]}, \gamma^{[l]}$ parameters are added to the model **
- **They are not hyperparameters needed to tune but the parameters needed to update**.
    - Updates include $dW^{[l]}, d\gamma^{[l]}, d\beta^{[l]}$
- Use some optimization algorithms to update these two learnable parameters.
- In differen mini-batches, $\beta, \gamma$ only rely on the current mini-batch.

**Note:**
- In Bath Norm, the parameter $b$ can be removed and end up with only using the parameter $beta$, because 5b5 is just a number added to $z$ and will be removed in the normalization part. Thus, we can only use one parameter $ \beta$ to control the mean. 

$$Z^{[l]} = W^{[l]} a^{[l-1]} $$

$$ Z^{[l]}_{norm} $$

$$ \tilde{Z}^{[l]} = \gamma^{[l]} Z^{[l]}_{norm} + \beta^{[l]} $$

where $\gamma^{[l]}, \beta^{[l]}$ are all $(n^{[l]}, 1)$


## 2.3 Why does Batch Norm Work?

1. **Like normalizing the train set, Batch Norm can make the hidden units learn faster.** 
2. **Make the deeper layers more roboust than the previous layers**. Limit the amount of the impacts from the prevous layers on the distribution of mean and variance in the current layer. 
    - The mean and variance are influented by the previous layers. 
    - Batch Norm can make the change still around 0 mean and 1 variance so that the distirbution are "stable".
    - Thus, Batch Norm can reduce the problem of the input values changing, so **the later layers are not too adaptive as the previous layers**. Hence, the coupling problem can be solved and each layer can learn independently. 
3. **Have a slight regularization effect.** Each mini-batch is scaled by the mean/variance computed on just that mini-batch. Thus, this adds some noise to the values within that mini-batch, just like what dropout does.        


## 2.4 Batch Norm at Test Time

Since at test time we cannot compute the mean/variance for each test sample, so what we do is to **estimate $u, \sigma^{2}$ using exponentially weighted average across mini-batches**.

- From the train set, we can obtian: 
$$ u^{\{1\}[l]}, u^{\{2\}[l]}, u^{\{3\}[l]}, ....  $$
$$ \sigma^{2 \; \{1\}[l]}, \sigma^{2 \; \{2\}[l]}, \sigma^{2 \; \{3\}[l]}, ....  $$
- Thus, using exponentially weighted average to obtain the $u, \sigma^{2}$ for that layer at test time.


# 3. Multi-class Classification

## 3.1 Softmax Regression

**Notation:**
- C = #classes = $n^{[L]}$ 
- Output Size: $\hat{Y} : (C, m)$

**Softmax Regression:**
$$ Z^{[L]} = W^{[L]} \; a^{[L]} + b^{[L]} $$

$$ t = e^{Z^{[L]}} $$
$$ a^{[L]} = \frac{t}{\sum_{j} \; t_{j}} $$



## 3.2 Training a Softmax Regression

**Understanding Softmax:**
- Softmax regression generalizes logistic regression to C classes. 


**Lost Function:**

$$ \mathcal{L}(\hat{y}, y) = - \sum_{j=1}^{C} \; y_{j}\log{\hat{y}_{j}} $$


**Cost Function:**

$$ J(W^{[1]}, b^{[1]}, ...) = \frac{1}{m} \; \sum_{i=1}^{m} \; \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) $$


**Derivatives of Softmax Regression:**




$$ dZ^{[L]} = \frac{\partial{J}}{\partial{Z^{[L]}}} =  \hat{y} - y $$

# 4. Introduction to Programming Frameworks 

**Deep Learning Framework:**
- Caffe/Caffe2
- CNTK
- DL4L 
- Keras
- Lasagne
- Mxnet
- PaddlePaddle
- TensorFlow
- Theano
- Torch

Careful to: 
- ease of programming
- running speed
- truly open with good governance


**TensorFlow**

- An efficient tool which only need to denote the forward propogation while the backward propogation will be automatically computed using the built-in functions.


- i.e. how to minimize the cost function 

![](./imgs/tensorflow-eg.jpg)

**Note:**
- `with` command is better at cleaning up in cases an error in exception while executing this inner loop. 
- `tf.placeholder` is a holder in a formular while `feed_dict={x:coef}` is to feed the data to the place holder. By doing this can make the programming easier to switch input data.


---

# Quiz

**Batch Norm:**

- There is no optimal combination of $\gamma, \beta$. Both need to be turned and rely on the minimal cost function.

---

# Assignments

# 1. Basic Operations in TensorFlow

## 1.1 Run an Operation

In [None]:
a = tf.constant(2)
b = tf.constant(10)
c = tf.multiply(a,b)
print(c)

```python 
Tensor("Mul:0", shape=(), dtype=int32)
```


As expected, you will not see 20! You got a tensor saying that the result is a tensor that does not have the shape attribute, and is of type "int32". 
- **All you did was put in the 'computation graph', but you have not run this computation yet.** 
- In order to actually multiply the two numbers, you will have to create a session and run it.

In [None]:
sess = tf.Session()
print(sess.run(c))

```python
20
```

To summarize, remember to **initialize your variables, create a session and run the operations inside the session**.


## 1.2 Placeholder

In [None]:
# Change the value of x in the feed_dict

x = tf.placeholder(tf.int64, name = 'x')
print(sess.run(2 * x, feed_dict = {x: 3}))
sess.close()

```python 
6
```

When you first defined x you did not have to specify a value for it. **A placeholder is simply a variable that you will assign data to only later, when running the session.** We say that you feed data to these placeholders when running the session.

Here's what's happening: When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. The computation graph can have some placeholders whose values you will specify only later. Finally, when you run the session, you are telling TensorFlow to execute the computation graph.

## 2. Sign Recognization

**Problem:**

- To teach our computers to decipher sign language. It's now your job to build an algorithm that would facilitate communications from a speech-impaired person to someone who doesn't understand sign language.


**Data:**
- Training set: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number).
- Test set: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number).

Note that this is a subset of the SIGNS dataset. The complete dataset contains many more signs.

Here are examples for each number, and how an explanation of how we represent the labels. These are the original pictures, before we lowered the image resolutoion to 64 by 64 pixels.

![](./imgs/hw-data.png)


**Model:**
- Two hidden layers with ReLU and one output layer with softmax.
- Hidden units are [25, 12, 6]
- Learning rate: 0.0001
- #epochs: 1500
- Mini-batch size: 32


**Programming Note:**
1. Make sure the data type of the parameters is the same as the input data. When doing the initialization and propogation, one can set them consistent: 

```python
# During parameter initialization
W1 = tf.get_variable("W1", [25, 12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1, dtype = tf.float32))
b1 = tf.get_variable("b1", [25, 1], initializer = tf.zeros_initializer(dtype = tf.float32))

# During forward propogation
Z1 = tf.add(tf.matmul(W1, tf.cast(X, tf.float32)), b1)
```

2. Some functions require the name of the input parameters, so don't ignore them. 


**Accuracy:**

![](./imgs/hw-acc.jpg)

<font color='blue'>
    
**What you should remember**:
- Tensorflow is a programming framework used in deep learning
- The two main object classes in tensorflow are Tensors and Operators. 
- When you code in tensorflow you have to take the following steps:
    - Create a graph containing Tensors (Variables, Placeholders ...) and Operations (tf.matmul, tf.add, ...)
    - Create a session
    - Initialize the session
    - Run the session to execute the graph
- You can execute the graph multiple times as you've seen in model()
- The backpropagation and optimization is automatically done when running the session on the "optimizer" object.