### Rashab and Francesco
- deep learning researcher at Stanford working with multiple labs
- background in deep learning at Harvard and Stanford
- bioinformatics

# Shapes

$$ X_t \dot \textbf{U} $$
(batch, 5) x (5,L) = (batch, L)

for Recurrent neural nets, also have $h_{t-1} \dot W$
and W is (L, L) **because all nodes recurrently feed into all other nodes in layer!**
and h is (batch, L)


if input is (5,1) and layer has 3 nodes and is recurrent, output will have be 3 numbers

## Bias trick

- no difference between bias and weight
- bias trick is to input all 1's as an example, learn weights with that example as well
- learns information that is input-independent
- for example, when object doesn't move in image, the bias will learn to detect the object (or lack thereof) in that location

*The bias node/term is there only to ensure the predicted output will be unbiased. If your input has a dynamic (range) that goes from -1 to +1 and your output is simply a translation of the input by +3, a neural net with a bias term will simply have the bias neuron with a non-zero weight while the others will be zero. If you do not have a bias neuron in that situation, all the activation functions and weigh will be optimized so as to mimic at best a simple addition, using sigmoids/tangents and multiplication.*

*If both your inputs and outputs have the same range, say from -1 to +1, then the bias term will probably not be useful.*

*You could have a look at the weigh of the bias node in the experiment you mention. Either it is very low, and it probably means the inputs and outputs are centered already. Or it is significant, and I would bet that the variance of the other weighs is reduced, leading to a more stable (and less prone to overfitting) neural net.*

## Log loss

$$ -\frac{1}{n} \sum_{batch} \sum_{labels} y_{label} ln( \hat{y} )$$

## Learning rate parameters

- increasing batch size is equivalent to decreasing learning rate

## CNNs
### What changes between FCN and CNN?

#### receptive field

#### kernel/filter
- looks at relationships between pixels 
- instead of using one weight per pixel, have as many weights as your filter
- e.g. for a 9x9 pixel image, FCN has 81 weights, CNN has 9 weights if you use a 3x3 filter
- in practice, weights from the individual nodes are linked, and each node output is a feature map of n numbers (result of convolution)
- if you wanted to draw it the way you draw a dense network, a 9x9 image with a 3x3 filter would have 9x9xn nodes in next layer, where n = 9-3+1 = 7 (?)

#### Shape of weight tensor
$$ (H_F, W_F, C_{input}, C_{output})$$

#### pooling
- down-sampling to reduce the resolution of the feature maps

#### ImageNet
- a labeled dataset that has given rise to many pre-trained models, e.g. VGG or ResNet
- used to be a benchmark

#### Sequence problems
- language translation is a many-to-many
$$ y_t = f(vx + uy_{t-1})$$

- so there are also special weights associated with the previous output that figure into the next prediction

#### Attention layers
- take in a corpus (imagery or text, e.g.) and a question to be answered
- could this be used to answer multiple questions from same image?

#### Archictecture tips
- Conv, pooling, Relu, Conv, pooling, Relu, ...., Flatten, Dense, outputs
- don't put Relu before pooling; it gives the same output as Pool > Relu but is more expensive


## Features from text

- have a dictionary where each word is encoded by an index (dense representation of one-hot)
- but this doesn't capture relationships between words
- instead, use embedding to go from a simple index to a dense vector for each word.  You choose the dimensions, but typical choices are 50-100 weights for each word
- so similar words should have similar meanings
- and you can have relationships between words, indicated by **parallel vectors**
- can put these vectors as first layer in network and learn their weights
- so # words * vector length per word is size of first layer
- word2vec learns vectors directly by predicting co-occurrences of words in context

## Getting set up
- https://www.dropbox.com/sh/rptnshr1j0nqmh6/AACbihf2aYQNG6rzH9Wh29eLa?dl=0

- conda update conda
- conda env create
- source activate dataweekends
- jupyter notebook


### Keras versus Tensorflow
- Keras is high-level API easy for beginners
- Keras runs on theano or tensorflow
- Keras is less flexible, less verbose
- Keras works well for training prod models, but you might not want to use it if:
    - serving/hosting on the cloud
    - custom loss function
    - custom architecture

# Transfer learning

### Motivation
- want to learn a new model, but don't have a lot of data
- want to make use of features learned for another model

### How to do it
- if you want to predict same number of classes, don't change anything
- imagining features from low-level (early hidden layers) to high-level, how far into the architecture do you think your new problem will diverge from the old problem?
- freeze layers you don't want to retrain (and don't learn anything else about them! They will stay the same), except for last softmax layer. Those frozen layers give you bottleneck layers, then have a dense FCN at the end
- better to start with a model pre-trained on **many** categories, not just dogs vs cats
- un-freeze end layers, or beginning layers, or both
- at a minimum, re-train all layers after convolutional/pooling

### Image augmentation
- instead of image augmentation by rotation, translation, etc.  Can add noise to dense layers to "augment"

# Saving models
- Used to be PMML
- Now ONNX (pronounced onyx)

# Recurrent neural networks

### Activation function
- Notice that weights w and u don't depend on time
- first order filter with nonlinear function on top
- kind of like a moving average
$$ h_t = tanh(w^1h^1_{t-1} + u^1x_t)$$

$$ h_t^2 = tanh(w^2h^2_{t-1} + u^2h^1_t)$$

### How to train
- if you want to capture seasonality, need to train with sequence at least as long as the period

# Deployment

- TF lite : for deploying to mobile phones
- TF JS: for deploying into browser
- TF serving: google's native way of serving models
    - has continuous training pipeline
    - deploy multiple models
    - not easy, so here are some other options:
- AWS sagemaker
    - hosted jupyter notebook where you define models
    - training engine where you launch model for training
    - endpoint declaration to create a model API
- PipelineAI
    - simpler but more experimental (right now) than sagemaker
-Floydhub
    - deploy by running "floyd run --mode serve" in command line
    - deploys behind Flask app
    - not the most performant but very simple
- google cloud ML
    - cloud data lab (like jupyter notebok)
    - very similar to sagemaker
    - train on GCP, deploy to API endpoint
- determinedAI
    - paid product that manages deployments for you
    
#### Example using flask and jquery to deploy a model
github.com/ghego/tensorflow-mnist