In [1]:
using Pkg; Pkg.activate("."); Pkg.instantiate();

  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
[?25l[2K[?25h Installed Adapt ──────────── v0.4.0
 Installed Compat ─────────── v1.3.0
 Installed SpecialFunctions ─ v0.7.1
 Installed StaticArrays ───── v0.9.1
 Installed ForwardDiff ────── v0.10.0
 Installed Flux ───────────── v0.6.8
  Building SpecialFunctions → `~/.julia/packages/SpecialFunctions/sXbz6/deps/build.log`


Deep Learning with Flux: A 60 Minute Blitz
=====================

This is a quick intro to [Flux](https://github.com/FluxML/Flux.jl) loosely
based on [PyTorch's
tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html).
It introduces basic Julia programming, as well as Flux's automatic
differentiation (AD), which we'll use to build machine learning models. We'll
use this to build a very simple neural network.

Arrays
-------

The starting point for all of our models is the `Array` (sometimes referred to
as a `Tensor` in other frameworks). This is really just a list of numbers,
which might be arranged into a shape like a square. Let's write down an array
with three elements.

In [2]:
x = [1, 2, 3]

3-element Array{Int64,1}:
 1
 2
 3

Here's a matrix – a square array with four elements.

In [3]:
x = [1 2; 3 4]

2×2 Array{Int64,2}:
 1  2
 3  4

We often work with arrays of thousands of elements, and don't usually write
them down by hand. Here's how we can create an array of 5×3 = 15 elements,
each a random number from zero to one.

In [4]:
x = rand(5, 3)

5×3 Array{Float64,2}:
 0.235191   0.199408  0.678128
 0.541358   0.321405  0.771729
 0.0140719  0.572822  0.212853
 0.276631   0.367418  0.77131 
 0.0242593  0.678996  0.740179

There's a few functions like this; try replacing `rand` with `ones`, `zeros`,
or `randn` to see what they do.

By default, Julia works stores numbers is a high-precision format called
`Float64`. In ML we often don't need all those digits, and can ask Julia to
work with `Float32` instead. We can even ask for more digits using `BigFloat`.

In [5]:
x = rand(BigFloat, 5, 3)

5×3 Array{BigFloat,2}:
 9.14732e-01  5.01332e-01  5.96429e-01
 2.09742e-01  4.91112e-01  9.07722e-01
 9.17872e-01  4.13105e-01  3.88356e-01
 3.84251e-01  7.75879e-01  7.67142e-02
 9.47119e-01  4.00892e-01  3.18108e-01

In [6]:
x = rand(Float32, 5, 3)

5×3 Array{Float32,2}:
 0.255267   0.3593    0.803596
 0.941981   0.537916  0.668473
 0.588489   0.873691  0.288548
 0.782991   0.67958   0.858299
 0.0474998  0.144007  0.464887

We can ask the array how many elements it has.

In [7]:
length(x)

15

Or, more specifically, what size it has.

In [8]:
size(x)

(5, 3)

We sometimes want to see some elements of the array on their own.

In [9]:
x

5×3 Array{Float32,2}:
 0.255267   0.3593    0.803596
 0.941981   0.537916  0.668473
 0.588489   0.873691  0.288548
 0.782991   0.67958   0.858299
 0.0474998  0.144007  0.464887

In [10]:
x[2, 3]

0.66847277f0

This means get the second row and the third column. We can also get every row
of the third column.

In [11]:
x[:, 3]

5-element Array{Float32,1}:
 0.80359554
 0.66847277
 0.28854823
 0.8582989 
 0.46488714

We can add arrays, and subtract them, which adds or subtracts each element of
the array.

In [12]:
x + x

5×3 Array{Float32,2}:
 0.510534   0.718599  1.60719 
 1.88396    1.07583   1.33695 
 1.17698    1.74738   0.577096
 1.56598    1.35916   1.7166  
 0.0949996  0.288014  0.929774

In [13]:
x - x

5×3 Array{Float32,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

Julia supports a feature called *broadcasting*, using the `.` syntax. This
tiles small arrays (or single numbers) to fill bigger ones.

In [14]:
x .+ 1

5×3 Array{Float32,2}:
 1.25527  1.3593   1.8036 
 1.94198  1.53792  1.66847
 1.58849  1.87369  1.28855
 1.78299  1.67958  1.8583 
 1.0475   1.14401  1.46489

We can see Julia tile the column vector `1:5` across all rows of the larger
array.

In [15]:
zeros(5,5) .+ (1:5)

5×5 Array{Float64,2}:
 1.0  1.0  1.0  1.0  1.0
 2.0  2.0  2.0  2.0  2.0
 3.0  3.0  3.0  3.0  3.0
 4.0  4.0  4.0  4.0  4.0
 5.0  5.0  5.0  5.0  5.0

The x' syntax is used to transpose a column `1:5` into an equivalent row, and
Julia will tile that across columns.

In [16]:
zeros(5,5) .+ (1:5)'

5×5 Array{Float64,2}:
 1.0  2.0  3.0  4.0  5.0
 1.0  2.0  3.0  4.0  5.0
 1.0  2.0  3.0  4.0  5.0
 1.0  2.0  3.0  4.0  5.0
 1.0  2.0  3.0  4.0  5.0

We can use this to make a times table.

In [17]:
(1:5) .* (1:5)'

5×5 Array{Int64,2}:
 1   2   3   4   5
 2   4   6   8  10
 3   6   9  12  15
 4   8  12  16  20
 5  10  15  20  25

Finally, and importantly for machine learning, we can conveniently do things like
matrix multiply.

In [18]:
W = randn(5, 10)
x = rand(10)
W * x

5-element Array{Float64,1}:
  0.4412423496417188
  0.7211509855721309
 -2.674496439299549 
  1.5088700322596038
  2.104635483288166 

Julia's arrays are very powerful, and you can learn more about what they can
do [here](https://docs.julialang.org/en/v1/manual/arrays/).

### CUDA Arrays

CUDA functionality is provided separately by the [CuArrays
package](https://github.com/JuliaGPU/CuArrays.jl). If you have a GPU and CUDA
available, you can run `] add CuArrays` in a REPL or IJulia to get it.

Once CuArrays is loaded you can move any array to the GPU with the `cu`
function, and it supports all of the above operations with the same syntax.

In [19]:
#using CuArrays
# x = cu(rand(5, 3))

Automatic Differentiation
-------------------------

You probably learned to take derivatives in school. We start with a simple
mathematical function like

In [20]:
f(x) = 3x^2 + 2x + 1

f(5)

86

In simple cases it's pretty easy to work out the gradient by hand – here it's
`6x+2`. But it's much easier to make Flux do the work for us!

In [21]:
using Flux.Tracker: derivative

df(x) = derivative(f, x)

df(5)

32.0 (tracked)

You can try this with a few different inputs to make sure it's really the same
as `6x+2`. We can even do this multiple times (but the second derivative is a
fairly boring `6`).

In [22]:
ddf(x) = derivative(df, x)

ddf(5)

6.0 (tracked)

Flux's AD can handle any Julia code you throw at it, including loops,
recursion and custom layers, so long as the mathematical functions you call
are differentiable. For example, we can differentiate a Taylor approximation
to the `sin` function.

In [23]:
mysin(x) = sum((-1)^k*x^(1+2k)/factorial(1+2k) for k in 0:5)

x = 0.5

mysin(x), derivative(mysin, x)

(0.4794255386041834, 0.8775825618898637 (tracked))

In [24]:
sin(x), cos(x)

(0.479425538604203, 0.8775825618903728)

You can see that the derivative we calculated is very close to `cos(x)`, as we
expect.

This gets more interesting when we consider functions that take *arrays* as
inputs, rather than just a single number. For example, here's a function that
takes a matrix and two vectors (the definition itself is arbitrary)

In [25]:
using Flux.Tracker: gradient

myloss(W, b, x) = sum(W * x .+ b)

W = randn(3, 5)
b = zeros(3)
x = rand(5)

gradient(myloss, W, b, x)

(Flux.Tracker.TrackedReal{Float64}[0.791151 (tracked) 0.0901694 (tracked) … 0.0116059 (tracked) 0.612178 (tracked); 0.791151 (tracked) 0.0901694 (tracked) … 0.0116059 (tracked) 0.612178 (tracked); 0.791151 (tracked) 0.0901694 (tracked) … 0.0116059 (tracked) 0.612178 (tracked)], Flux.Tracker.TrackedReal{Float64}[1.0 (tracked), 1.0 (tracked), 1.0 (tracked)], Flux.Tracker.TrackedReal{Float64}[3.58823 (tracked), 0.656771 (tracked), 0.988093 (tracked), -0.317437 (tracked), 3.08113 (tracked)])

Now we get gradients for each of the inputs `W`, `b` and `x`, which will come
in handy when we want to train models.

Because ML models can contain hundreds of parameters, Flux provides a slightly
different way of writing `gradient`. We instead mark arrays with `param` to
indicate that we want their derivatives.

In [26]:
using Flux.Tracker: param, back!, grad

W = param(randn(3, 5))
b = param(zeros(3))
x = rand(5)

y = sum(W * x .+ b)

0.704990287538036 (tracked)

Anything marked `param` becomes *tracked*, indicating that Flux keeping an eye
on its gradient. We can now call

In [27]:
back!(y) # Run backpropagation

grad(W), grad(b)

([0.498501 0.793961 … 0.968552 0.968621; 0.498501 0.793961 … 0.968552 0.968621; 0.498501 0.793961 … 0.968552 0.968621], [1.0, 1.0, 1.0])

We can now grab the gradients of `W` and `b` directly from those parameters.

This comes in handy when working with *layers*. A layer is just a handy
container for some parameters. For example, `Dense` does a linear transform
for you.

In [28]:
using Flux

m = Dense(10, 5)

x = rand(10)

m(x)

Tracked 5-element Array{Float64,1}:
  0.2116432868503058  
  0.5503366583782249  
  0.4415243106995927  
  0.40653200184721167 
 -0.016813293258632997

In [29]:
m(x) == m.W * x .+ m.b

true

We can easily get the parameters of any layer or model with params with
`params`.

In [30]:
params(m)

2-element Array{Any,1}:
 Flux.Tracker.TrackedReal{Float64}[0.370221 (tracked) 0.242663 (tracked) … 0.478084 (tracked) 0.19404 (tracked); 0.583133 (tracked) -0.0196383 (tracked) … -0.421909 (tracked) -0.590144 (tracked); … ; 0.267234 (tracked) -0.436362 (tracked) … -0.212825 (tracked) -0.207647 (tracked); 0.526019 (tracked) 0.387505 (tracked) … -0.35565 (tracked) -0.345277 (tracked)]
 Flux.Tracker.TrackedReal{Float64}[0.0 (tracked), 0.0 (tracked), 0.0 (tracked), 0.0 (tracked), 0.0 (tracked)]                                                                                                                                                                                                                                                            

This makes it very easy to do backpropagation and get the gradient for all
parameters in a network, even if it has many parameters.

In [31]:
m = Chain(Dense(10, 5, relu), Dense(5, 2), softmax)

l = sum(Flux.crossentropy(m(x), [0.5, 0.5]))
back!(l)

grad.(params(m))

4-element Array{Array{Float64,N} where N,1}:
 [0.000558462 0.000254322 … 0.000641503 0.000918226; 0.0 0.0 … 0.0 0.0; … ; -0.000359224 -0.00016359 … -0.000412639 -0.000590638; 6.01219e-5 2.73794e-5 … 6.90617e-5 9.88526e-5]
 [0.00547642, 0.0, 0.0, -0.00352264, 0.000589571]                                                                                                                               
 [-0.00134023 0.0 … -0.00230052 -0.00559793; 0.00134023 0.0 … 0.00230052 0.00559793]                                                                                            
 [-0.00859058, 0.00859058]                                                                                                                                                      

You don't have to use layers, but they can be convient for many simple kinds
of models.