Notebook on Supervised Machine Learning
---
https://fluxml.ai/Flux.jl/stable/

**Flux is a 100% pure Julia library** for deep learning, that is increasingly very capable. Most other popular libraries are in Python, though the numerically intensive parts are written in C,C++ or CUDA.

Flux keeps it simple. The whole library is written in a language that is understandable and minimal confusion from  boilerplate code that acts as an "interaction module" between two different languages. This makes Flux and extremely powerful platform to innovate on.

In [1]:
import Flux

Flux internally calls powerful automatic differentiation (AD) libraries such Zygote and ChainRules that enable differentiation of arbitrary functions.

As an example, lets define a function

In [2]:
f(x) = 3x^2 + 2x + 1

f (generic function with 1 method)

In [3]:
f(1)

6

We can now compute a derivative of the function using Flux.gradient function

In [4]:
df(x) = Flux.gradient(f,x)[1]

df (generic function with 1 method)

In [5]:
df(1)

8.0

We can also compute a 2nd derivative of the same function:

In [6]:
d2f(x) = Flux.gradient(df,x)[1]

d2f (generic function with 1 method)

In [7]:
d2f(1)

6.0

**Note**: Flux internally imports the *gradient* function from Zygote. So if we want more customized gradient behavior, we can write it directly in Zygote and use it in Flux code. It will play well seamlessly.

In [8]:
loss(x,W,b) = sum(W*x + b)

x = rand(5);
W = randn(3,5);
b = rand(3);

In [9]:
loss(x,W,b)

0.7787114970970999

Here we have defined a loss function, and some "dummy" input x, with parameters W and b.
We can compute the gradients of the parameters with respect to loss!

In [10]:
W*x + b

3-element Vector{Float64}:
  0.6019391103022558
 -0.6210176395722369
  0.797790026367081

In [11]:
out = Flux.gradient(loss, x, W, b)

([-0.7508029092587667, 1.5712926425104774, -2.1883718522309934, -0.07934216680197315, 1.8424321979533858], [0.9635529035428061 0.4723334059313018 … 0.2709275027856839 0.4229057430533192; 0.9635529035428061 0.4723334059313018 … 0.2709275027856839 0.4229057430533192; 0.9635529035428061 0.4723334059313018 … 0.2709275027856839 0.4229057430533192], 3-element Fill{Float64}: entries equal to 1.0)

In [12]:
typeof(out)

Tuple{Vector{Float64}, Matrix{Float64}, FillArrays.Fill{Float64, 1, Tuple{Base.OneTo{Int64}}}}

In [13]:
out[1]

5-element Vector{Float64}:
 -0.7508029092587667
  1.5712926425104774
 -2.1883718522309934
 -0.07934216680197315
  1.8424321979533858

In [14]:
out[2]

3×5 Matrix{Float64}:
 0.963553  0.472333  0.600576  0.270928  0.422906
 0.963553  0.472333  0.600576  0.270928  0.422906
 0.963553  0.472333  0.600576  0.270928  0.422906

In [15]:
out[3]

3-element Fill{Float64}: entries equal to 1.0

Therefore, the gradient computation gives a Tuple of size 3, with the derivative of the loss with respect to each 
variable.

Since this is explicit passing of arguments, this can be cumbersome for neural networks that have a large number of parameters. Flux allows us to abstract this away with the *params* function.

Here, the *params* automatically extracts the parameters in a model and implicitly passes it to the *gradient* function call

In [16]:
model(x) = sum(W*x .+ b)
grads = Flux.gradient(()->model(x), Flux.params([W, b]))

Grads(...)

In [17]:
grads[W]

3×5 Matrix{Float64}:
 0.963553  0.472333  0.600576  0.270928  0.422906
 0.963553  0.472333  0.600576  0.270928  0.422906
 0.963553  0.472333  0.600576  0.270928  0.422906

In [18]:
grads[b]

3-element Fill{Float64}: entries equal to 1.0

Lets explore the rest of the training workflow. We generate some "dummy" data for ground truth $\hat{y}$ .

In [19]:
ŷ = rand(3)

3-element Vector{Float64}:
 0.88167292372511
 0.708295264149122
 0.6385495471693718

We first define an optimizer: Flux has several in-built optimizers. Here, we use the classic Gradient Descent algorithm, with a learning rate $\alpha = 0.1$ passed as an argument.

More optimizers, such as ADAM and AdaDelta are available. More details: https://fluxml.ai/Flux.jl/stable/training/optimisers/#Optimiser-Reference

In [32]:
opt = Flux.Descent(0.1)

Flux.Optimise.Descent(0.1)

We now define a loss function manually to demonstrate the flexibility (since scientific ML often uses custom loss functions). However, Flux has its own predefined set of commonly used losses, such as MSE. More details here: https://fluxml.ai/Flux.jl/stable/models/losses/

In [33]:
function loss(x, ŷ)
  y = model(x)
  sum((y .- ŷ).^2)
end

loss (generic function with 2 methods)

Finally, we create the supervised learning training dataset pair with $x$ and $\hat{y}$ and train with the *Flux.train* function. Each call of this function performs gradient descent and optimization for just 1 epoch.

We can put this function in a for loop or use the *@epochs* macro - which we will see in a complete example in the MNIST notebook

In [39]:
data = zip(x,ŷ)

Flux.train!(loss, Flux.params(model), data, opt)

In [37]:
training_outputs

## Flux Layers

Apart from AD, an integral part of a good ML framework is predefined layers. We look at some of them now, with other notebooks to demonstrate it in action.

#### Dense layers

The classic, fully-connected neural network layer. Flux also provides ready access to various activation functions.

**Note**: Lot of the "primitive" operations are provided by Julia NNlib.jl library, which is part of FluxML family and is also 100% pure Julia. It has activation functions, primitive convolution and other helper functions. Flux uses these to build higher level operations, such as layers

In [24]:
layer1 = Flux.Dense(5, 5, tanh)

Dense(5, 5, tanh)   [90m# 30 parameters[39m

25 Weight params + 5 bias params = 30 total trainable parameters

We can also define a vector of layers, with each element containing a Dense layer and/or activation function

In [25]:
layers = [Flux.Dense(5, 10, Flux.sigmoid), Flux.Dense(10, 2), Flux.softmax]

3-element Vector{Any}:
 Dense(5, 10, σ)     [90m# 60 parameters[39m
 Dense(10, 2)        [90m# 22 parameters[39m
 softmax (generic function with 2 methods)

We can pass arguments through each element to compute the layer output..

In [31]:
x

5-element Vector{Float64}:
 0.9635529035428061
 0.4723334059313018
 0.6005761070834574
 0.2709275027856839
 0.4229057430533192

In [26]:
layer1(x)

5-element Vector{Float64}:
  0.27753073347154505
  0.7792758399252536
 -0.26706347104574846
  0.6301648124914196
 -0.15858799843246946

In [27]:
out1 = layers[1](x)

10-element Vector{Float64}:
 0.5163512907362315
 0.35190944212016106
 0.3671830367892445
 0.4487145595513561
 0.3977774532545742
 0.7153587368407578
 0.50402188470735
 0.6417132227112197
 0.5717857808149603
 0.691745862998335

In [28]:
out2 = layers[2](out1)

2-element Vector{Float64}:
 -0.5806927900448918
 -1.1576732660403926

In [29]:
out3 = layers[3](out2)

2-element Vector{Float64}:
 0.6403723183519632
 0.3596276816480369

Note that this is not the only way to do multiple layers. In practice, we use Flux built-in *Chain* utility that packages these layers such that the outputs of each feed into the next layers as inputs.

In [30]:
m = Flux.Chain(Flux.Dense(5, 10, Flux.sigmoid), Flux.Dense(10, 2), Flux.softmax)

Chain(
  Dense(5, 10, σ),                      [90m# 60 parameters[39m
  Dense(10, 2),                         [90m# 22 parameters[39m
  NNlib.softmax,
)[90m                   # Total: 4 arrays, [39m82 parameters, 584 bytes.

In [105]:
out_chain = m(x)

2-element Vector{Float64}:
 0.08627275302705718
 0.9137272469729427

#### Convolutional Layers

Flux provides Convolutional Layers where we can set a) Kernel size b) Channels c) Padding d) Activation function

In [108]:
convlayer = Flux.Conv((3,3), 1 => 1, pad=1, Flux.relu);

Kernel size = 3 x 3

Input channels = 1

output channels = 1

padding for domain = 1

Activation function = ReLU

In [106]:
xmatrix = Float32.(randn(10,10,1,1))

10×10×1×1 Array{Float32, 4}:
[:, :, 1, 1] =
 -0.947454     1.16128     0.344419   …  -0.844508  -1.70192    0.22581
  0.00169597   1.42804    -2.00583        1.26896   -1.06014    2.82868
 -1.44         1.65566     0.716816      -1.58284    0.617037   0.0237813
  0.993203    -0.0195919  -1.71427       -0.406362   2.06349    0.195197
 -0.929576    -1.15719    -0.116402      -0.459073   0.961306  -0.673349
  0.201978     0.524269   -0.842834   …  -0.536494   0.697359   0.313987
  0.646302    -1.63392    -0.0548294     -0.162385  -0.287026  -1.63558
  0.433773     0.804713    0.814234      -0.241869  -2.66339    0.836067
  0.447325     0.126156   -0.140827       0.47284   -1.68723    0.604225
 -0.269321     2.09095     0.528654      -0.777938  -0.173061  -0.229951

In [107]:
out_conv = convlayer(xmatrix)

10×10×1×1 Array{Float32, 4}:
[:, :, 1, 1] =
 0.0       0.554095  0.0        0.714402   …  0.303955   0.0       0.134419
 0.482107  0.0       0.0        1.69384       0.0        0.509701  0.0
 1.06477   0.0       0.979015   0.0           0.0        1.99877   0.0
 1.5495    0.165541  1.84549    0.0           0.0        0.0       0.0
 0.0       0.319931  0.289391   0.0           0.0942586  0.0       0.36402
 0.0       0.0       0.0220556  1.18722    …  1.54803    0.47      0.879977
 0.0       0.0       0.0        0.0           3.16579    0.557138  2.14885
 0.0       0.0       0.0        0.0           1.72129    0.0       1.30886
 0.0       0.0       0.2324     0.0           0.0        1.31942   0.0
 0.397504  0.0       0.0        0.0361162     0.0        0.860682  0.0

Common issue: matrices are default Float64, while Flux layers use default Float32. This can cause inefficiences in training (and Flux will caution you). Make sure to explicitly cast type