# Early lessons from fast.ai, Part I

Clayton Yochum - Methods Consultants

## Not the droids you're looking for

I'm not going to cover
- math of deep learning
- DL for structured problems (e.g. timeseries)
- NLP anything
- embeddings (this should probably be it's own talk at some point)
- how to implement any of this (minimal code)

![obi-wan](img/droids.gif)

## What is fast.ai?

Umbrella group for courses taught by Jeremy Howard & Rachel Thomas
- super smart people, lots of great content on their blog and twitter feeds

I took their "Deep Learning for Coders, Part 1, v2" last fall
- easily the best class on this stuff
- only pre-reqs are 1 year of programming experience and high school algebra

Other courses:
- machine learning (possibly incomplete)
- computational linear algebra

Heavy emphasis on top-down approach to teaching
- do cool stuff _first_, unravel details as course progresses
- the opposite how almost all traditional schooling operates
- increasingly popular paradigm in data science and technology
- I love it

## What is deep learning?

I'm not going to cover the mechanics, but at a high level, DL models
- are just big, fancy machine learning models with a ton of parameters
    - multi-layer neural networks
- have led to many breakthroughs in last ~5 years, including
    - image recognition
    - language translation
    - game playing
    - tons of other areas
- come in many different forms
    - fully connected (DNN/ANN)
    - CNN (**C**onvolutional, common for images)
    - RNN (**R**ecurrent, common for language/text understanding)
    - many more

## Deep learning is (was?) hard

Neural networks are an old idea, but deep versions didn't catch on until recently because now we have
- lots of (labeled) data
- lots of compute (GPUs!)
- better software (CUDA/CUDNN, Caffe/Torch/Tensorflow/Keras/PyTorch/etc)
- better algorithms (optimizers, efficient batching, network types)

Still, it's _hard_ to build a well-performing DL model
- many complex, interconnected hyperparameters to consider
    - architecture is hard; not just "tune these 10 knobs"
    - DL is defined as layers or DAG's with nearly endless possibilities

fast.ai tries to make this as easy as possible
- top-down approach
    - built a near-perfect image classifier during first lesson
- emphasis on tools
    - python, jupyter, numpy, tmux, remote linux servers w/ GPUs (AWS, Crestle, Paperspace)
- new `fastai` python library
    - like Keras, but for PyTorch (high-level abstraction)
    - built-in techniques not available anywhere else (yet)
    - 1 command in PyTorch tends to be 3-5 in Keras

## Trix are for Kids

![not rabbits](img/trix-rabbit.jpg)

...and also deep learning


## Image Classification is now easy

(for simple problems, at least; domains like medical imaging get a bit trickier)

We finally have enough techniques, tools, and tricks that combine to make it easy to get great results:
- pre-trained networks
- data augmentation
- learning rate annealing
- SGDR: stochastic gradient descent _with restarts_ (new-ish)
- learning rate finders (new)
- differential learning rates (new)
- easily repeatable workflow process (new)


## The power of pre-trained networks

Computer vision is the most obvious place where deep learning excels
- "is this a cat or a dog?"

This is largely due to using _pre-trained networks_, where someone else built a network for a different image task
- most common are networks trained on ImageNet, a collection of images where each belongs to one of 1000 classes like "cat", "dog", "tree", "car", etc.

This works because neural networks learn high-level feature representations of their inputs which also make sense for other inputs
- early layers learn to find lines, curves, dots
- then shapes (circles, squares), patterns (grid, honeycomb), and more (eyes, tires)
- we know this because of **de**convolutional neural nets: https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf

So even though a network may have been trained to do 1000-class single-label classification, I can repurpose it to do things like
- distinguish cats from dogs
- determine breeds of dogs
- label satellite images with one or more of hazy, grass, stream, etc.

Re-purposing means replacing the final layer from the pre-trained model with one that corresponds to our problem
- e.g. 2-node softmax to distinguish cats from dogs

Pre-trained networks aren't _just_ for images, but it's a lot easier here
- two random images are much more similar than e.g. two random audio snippets
- word embeddings 

## Data Augmentation

Deep learning does better the more data it has, but data collection is hard/expensive. One way to get more data is to modify your existing data to create new examples!

For each input image, you can rotate, zoom, crop, rotate, etc. each image to get a slightly different version
- chosen transforms will depend on the kinds of images you have
     - for side-on images of animals, we might do horizontal flips, but not rotations
     - for top-down images you might flip on both axes, rotate +-180

Thus we can easily double, quadruple, etc. our training set, and make it more robust to different input perturbations

There's also Test-Time Augmentation (TTA): for each test image, transform it a few times, predict each, average the predictions

fast.ai encourages and enables both types of augmentation

## Learning Rates

If you're using a pre-trained model, you don't have to worry about most hyperparameters, since you're using a network topology chosen by someone else.

But there's still one big one: the learning rate.

Most deep nets learn by updating their parameters via _gradient descent_ and _backpropogation_ (chain rule). We run some images through a model, get some predictions, compare those to the correct answers, then calculate a gradient to point us in a directions that will give us _better_ parameters. The learning rate is _how big of a step we take_.

If your learning rate is _too low_, it'll take forever to get a well-trained model.
If you learning rate is _too big_, you'll jump all around your param space and never find a nice local minima

Setting appropriate learning rates is the kind of thing that can drive users insane.

## LR Annealing

One technique that's been in common use for some time is _learning rate annealing_: decrease the learning rate as training progresses.

Start with big steps, take smaller and smaller steps as you get closer and closer to a good answer.

_Annealing_ here is like _simulated annealing_, another optimization technique
- swap "temperature" (SA) for "learning rate" (SGD)
- comes from metallurgy!

The learning rate is usually updated after each _batch_ (next slide), and fast.ai uses a cosine function to do this (as opposed to e.g. a straight line); _cosine annealing_

## SGDR

_Stochastic Gradient Descent with Restarts_

We train on _batches_ of inputs (images) at a time, where we make predictions with the whole batch for a given set of parameters, then use all those predictions to update our weights via gradient descent.

So "stochastic" just means we're updating incrementally, rather than pushing all our training examples through before updating our parameters. _Incremental_ gradient descent.

(We still take multiple passes through our inputs, usually; one use of all inputs is called an _epoch_)

The "R" is a bit of a newer technique: we anneal our learning rate over the course of one of more epochs (make it smaller each batch), but then jump back up to our original rate, and repeat training and annealing for additional epochs

## SGDR

Our learning rate ends up in patterns like this:

![fixed](img/sgdr_constant.png)

or this:

![cycle_mult](img/sgdr_cyclical.png)

There's another cool image in this paper on Snapshot ensembles: https://arxiv.org/pdf/1704.00109.pdf
- with a snapshot ensemble, you save the weights at the end of each annealing cycle, and average predictions from each
    - not something we've done in fast.ai, but can be useful

## Learning Rate Finder

So learning rates are important, and it's important to change them during training, but it's still important to start them in a good place. How are we supposed to do that?

Jeremy discovered a trick buried in a paper about something else, and used it to build a _learning rate finder_ into fast.ai.

Surprisingly simple: start the learning rate very low, and do up-to-an-epoch of training, increasing the learning rate exponentially between batches. Stop when loss is getting way worse or the epoch over.

Then we start with (roughly) as high a learning-rate as possible, where the loss is still clearly improving (getting lower)

## Learning Rate Finder

So the learning rate looks like this during the course of "training" (weight adjustments during learning rate finding are not persisted):

![lr vs time](img/sched-plot-lr.png)

Then we get a plot like

![loss vs lr](img/sched-plot-loss.png)

And in this case would _start_ our learning rate around 1e-2
- why not 1e-1?

## Differential Learning Rates

Yet another trick not widely in use outside of fast.ai is the concept of _differential learning rates_.

To get world-class performance, we generally train a model several times in a handful of different way. When we start, we're only tuning the final layer, the part specific to our problem, assuming the pre-trained layers are already pretty good.

Sometimes it makes sense to further tune the earlier layers, particularly if our images are substantially different from the ones the pre-trained model learned from.

However, these layers shouldn't need as much tuning as our last layer, so we want _smaller_ learning rates for earlier layers. Even among the pre-trained layers, we want to change the earlier layers less than the later ones. Thus we want differential learning rates, where different groups of layers learn at different rates.

Apparently this is possible in other DL frameworks, but it's a little easier in fast.ai.

## Unintutive High Dimensional Behavior

I really latched on to a minor aside from Jeremy in one of the first lessons:
> most local minima are roughly equivalent to both themselves and the global minima

If this is true, it doesn't really matter what local minima we're in, and it's a waste of time trying to find the global min!

The relevant paper is from January 2015: https://arxiv.org/pdf/1412.0233.pdf

It's dense (compares DNN's to Hamiltonian of spherical spin-glass models...), but also kinda mind-blowing.

Several big assumptions:
- variable independence
- redundancy of network parameterization
- uniformity

The basic gist is there's both theoretical and practical support for the notion that in sufficiently large networks, there's a "band" full of pesky saddle-points, but beyond that a band where critical points are nearly all high-quality local minima (increasingly true as network sizes increase).

This is why training a model with an order of magnitude more parameters than data somehow works!