Notebook on Supervised Machine Learning
---

**Flux is a 100% pure Julia library** for deep learning, that is increasingly very capable.

Most other popular libraries are in Python, though the numerically intensive parts are written in C, C++ or CUDA.

Flux keeps it simple.

The whole library is written in a language that is understandable and minimal confusion from boilerplate code that acts as an "interaction module" between two different languages.

This makes Flux and extremely powerful platform to innovate on.

In [None]:
import Flux

Flux internally calls powerful automatic differentiation (AD) libraries such Zygote and ChainRules that enable differentiation of arbitrary functions.

As an example, lets define a function:

In [None]:
f(x) = 3x^2 + 2x + 1

In [None]:
f(1)

We can now compute a derivative of the function using Flux.gradient function

In [None]:
df(x) = Flux.gradient(f,x)[1]

In [None]:
df(1)

We can also compute a 2nd derivative of the same function:

In [None]:
d2f(x) = Flux.gradient(df,x)[1]

In [None]:
d2f(1)

**Note**: Flux internally imports the *gradient* function from Zygote. So if we want more customized gradient behavior, we can write it directly in Zygote and use it in Flux code. It will play well seamlessly.

In [None]:
loss(x,W,b) = sum(W*x + b)

x = rand(5)
W = randn(3,5)
b = rand(3)

In [None]:
loss(x,W,b)

Here we have defined a loss function, and some "dummy" input x, with parameters W and b.
We can compute the gradients of the parameters with respect to loss!

In [None]:
sum(W*x + b)

In [None]:
out = Flux.gradient(loss, x, W, b)

In [None]:
typeof(out)

In [None]:
out[1]

In [None]:
out[2]

In [None]:
out[3]

Therefore, the gradient computation gives a Tuple of size 3, with the derivative of the loss with respect to each 
variable.

Since this is explicit passing of arguments, this can be cumbersome for neural networks that have a large number of parameters. Flux allows us to abstract this away with the *params* function.

Here, the *params* automatically extracts the parameters in a model and implicitly passes it to the *gradient* function call

In [None]:
model(x) = sum(W*x .+ b)
grads = Flux.gradient(()->model(x), Flux.params([W, b]))

In [None]:
grads[W]

In [None]:
grads[b]

Lets explore the rest of the training workflow. We generate some "dummy" data for ground truth $\hat{y}$ .

In [None]:
ŷ = rand(3)

We first define an optimizer: Flux has several in-built optimizers. Here, we use the classic Gradient Descent algorithm, with a learning rate $\alpha = 0.1$ passed as an argument.

More optimizers, such as ADAM and AdaDelta are available. More details: https://fluxml.ai/Flux.jl/stable/training/optimisers/#Optimiser-Reference

In [None]:
opt = Flux.Descent(0.1)

We now define a loss function manually to demonstrate the flexibility (since scientific ML often uses custom loss functions). However, Flux has its own predefined set of commonly used losses, such as MSE. More details here: https://fluxml.ai/Flux.jl/stable/models/losses/

In [None]:
function loss(x, ŷ)
  y = model(x)
  sum((y .- ŷ).^2)
end

Finally, we create the supervised learning training dataset pair with $x$ and $\hat{y}$ and train with the *Flux.train* function. Each call of this function performs gradient descent and optimization for just 1 epoch.

We can put this function in a for loop or use the *@epochs* macro - which we will see in a complete example in the MNIST notebook

In [None]:
data = zip(x,ŷ)

Flux.train!(loss, Flux.params(model), data, opt)

In [None]:
training_outputs

## Flux Layers

Apart from AD, an integral part of a good ML framework is predefined layers. We look at some of them now, with other notebooks to demonstrate it in action.

#### Dense layers

The classic, fully-connected neural network layer. Flux also provides ready access to various activation functions.

**Note**: Lot of the "primitive" operations are provided by Julia NNlib.jl library, which is part of FluxML family and is also 100% pure Julia. It has activation functions, primitive convolution and other helper functions. Flux uses these to build higher level operations, such as layers

In [None]:
layer1 = Flux.Dense(5, 5, tanh)

25 Weight params + 5 bias params = 30 total trainable parameters

We can also define a vector of layers, with each element containing a Dense layer and/or activation function

In [None]:
layers = [Flux.Dense(5, 10, Flux.sigmoid), Flux.Dense(10, 2), Flux.softmax]

We can pass arguments through each element to compute the layer output..

In [None]:
x

In [None]:
layer1(x)

In [None]:
out1 = layers[1](x)

In [None]:
out2 = layers[2](out1)

In [None]:
out3 = layers[3](out2)

Note that this is not the only way to do multiple layers. In practice, we use Flux built-in *Chain* utility that packages these layers such that the outputs of each feed into the next layers as inputs.

In [None]:
m = Flux.Chain(Flux.Dense(5, 10, Flux.sigmoid), Flux.Dense(10, 2), Flux.softmax)

In [None]:
out_chain = m(x)

#### Convolutional Layers

Flux provides Convolutional Layers where we can set:
* Kernel size 
* Channels 
* Padding 
* Activation function

In [None]:
convlayer = Flux.Conv((3,3), 1 => 1, pad=1, Flux.relu)

Kernel size = 3 x 3

Input channels = 1

output channels = 1

padding for domain = 1

Activation function = ReLU

In [None]:
xmatrix = Float32.(randn(10,10,1,1))

In [None]:
out_conv = convlayer(xmatrix)

Common issue: matrices are default Float64, while Flux layers use default Float32. This can cause inefficiences in training (and Flux will caution you). Make sure to explicitly cast type