# TOPICS
* Bias in data/incidents in DL
* Linear classifiers
* Softmax, loss
* Lab

### Canary
在进行正式的model之前，先小规模试验

## DNN
* Score function (forward pass)
* Loss function
* Update function (backward pass)

Weights \* Inputs + Bias = **Scores**

**Epoch**: An 'epoch' is one complete pass, which means we've used each example once -> forward + backward.

**Batch**: a non-overlapping chunk of the data

## Activation.
### ReLU
中间层activation 用ReLU.

用sigmoid在gradient based learning时的坏处：当$|z|$变大时，sigmoid的导数趋向于0，不利于learning, 特别在DNN中（此时我们是将一堆小的gradients相乘）。<br>
而ReLU算起来就快得多了，导数要么是1（z>0），要么是0（z<0）.

### Softmax
对于classification最后层，通常用softmax (multi class)。
softmax converts scores to probabilities.
Normalize each outpout to 0<x<1, and they sum to 1.

$\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$.

之所以用exp，是因为这样所有的prediction都>0.

在code里：<br>
softmax = np.exp(scores) / np.sum(np.exp(scores))

问题是，higher scores 会变得特别大。

## Loss
* 介于0（perfect）和正无穷（坏极了）之间的数，越小越好。
* 衡量predicted probability distribution 以及target probability distribution之间的差异。
* Cross entropy loss: 用于classfication。<br>
$L = -\sum_i \hat{y_i} \log (y_i)$，这其中$L$是cross entropy loss for a batch of examples, $\hat{y_i}$是true prob, $y_i$是predicted prob.
* 当softmax output on the correct example approaches 1, loss approaches 0, reaches a minimum; 当softmax output on the correct example approaches 0, loss reaches +inf, reaches a maximum.
* 显然，对于randomly initialize weights, average loss是loss = -ln(1/n_classes).

**为什么用cross entropy 而不是prediction error作为loss：**
我们希望incorrect predictions be penalized more!
各种error 并不equivalent.






## 用MNIST实现上述过程

In [0]:
# Training data
import numpy as np
import tensorflow as tf
mnist = tf.keras.datasets.mnist
# Handwritten Digit Recognition dataset
(x_train, y_train),(x_test, y_test) = mnist.load_data()

In [0]:
input_dim = 28*28
x_train = x_train.reshape((60000, input_dim)) #[60000, 784]

In [0]:
# Initialize weights and bias
hidden_size = 100
output_classes = 10

# Weights from input layer -> hidden layer
W = 0.01 * np.random.randn(input_dim, hidden_size) #[784,100]
b = np.zeros((1, hidden_size)) #[1,100]

# Weights from hidden layer -> output layer
W2 = 0.01 * np.random.randn(hidden_size, output_classes) #[100,10]
b2 = np.zeros((1, output_classes)) #[1,10]

In [0]:
# Begin gradient descent loop
epochs = 5
num_examples = 60000
for i in range(epochs):
  #Forward propogate
  hidden_layer = np.maximum(0, np.dot(x_train, W)+b) #ReLU [60000,100]
  scores = np.dot(hidden_layer, W2) + b2 #[60000,10]
  
  #Softmax
  exp_scores = np.exp(scores)
  probs = exp_scores /np.sum(exp_scores, axis = 1, keepdims=True) #[60000,10]
  
  #Loss
  losses = -np.log(probs[range(num_examples),y_train]) #[60000, 1]
  loss = np.sum(losses)/num_examples #[1,]