In [1]:
using Pkg; Pkg.activate("/home/dhairyagandhi96/temp/model-zoo/script/.."); Pkg.status();

    Status `~/temp/model-zoo/Project.toml`
  [1520ce14]   AbstractTrees v0.2.1
  [fbb218c0] ↑ BSON v0.2.3 ⇒ v0.2.4
  [54eefc05]   Cascadia v0.4.0
  [8f4d0f93]   Conda v1.3.0
  [864edb3b] ↑ DataStructures v0.17.0 ⇒ v0.17.5
  [31c24e10] ↑ Distributions v0.21.3 ⇒ v0.21.5
  [587475ba]   Flux v0.9.0
  [708ec375]   Gumbo v0.5.1
  [b0807396]   Gym v1.1.3
  [cd3eb016] ↑ HTTP v0.8.6 ⇒ v0.8.7
  [6218d12a]   ImageMagick v0.7.5
  [916415d5]   Images v0.18.0
  [e5e0dc1b]   Juno v0.7.2
  [ca7b5df7]   MFCC v0.3.1
  [dbeba491] + Metalhead v0.4.0 #c4d1eba (https://github.com/FluxML/Metalhead.jl.git)
  [91a5bcdd] ↑ Plots v0.26.3 ⇒ v0.27.0
  [2913bbd2]   StatsBase v0.32.0
  [98b73d46]   Trebuchet v0.1.0
  [8149f6b0] ↑ WAV v1.0.2 ⇒ v1.0.3
  [10745b16]   Statistics 
  [4ec0a83e]   Unicode 


Deep Learning with Flux: A 60 Minute Blitz
=====================

This is a quick intro to [Flux](https://github.com/FluxML/Flux.jl) loosely
based on [PyTorch's
tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html).
It introduces basic Julia programming, as well as Flux's automatic
differentiation (AD), which we'll use to build machine learning models. We'll
use this to build a very simple neural network.

Arrays
-------

The starting point for all of our models is the `Array` (sometimes referred to
as a `Tensor` in other frameworks). This is really just a list of numbers,
which might be arranged into a shape like a square. Let's write down an array
with three elements.

In [2]:
x = [1, 2, 3]

3-element Array{Int64,1}:
 1
 2
 3

Here's a matrix – a square array with four elements.

In [3]:
x = [1 2; 3 4]

2×2 Array{Int64,2}:
 1  2
 3  4

We often work with arrays of thousands of elements, and don't usually write
them down by hand. Here's how we can create an array of 5×3 = 15 elements,
each a random number from zero to one.

In [4]:
x = rand(5, 3)

5×3 Array{Float64,2}:
 0.314756   0.376296  0.547376
 0.997692   0.428444  0.802106
 0.873177   0.13091   0.937117
 0.234788   0.68169   0.239593
 0.0713249  0.530053  0.698805

There's a few functions like this; try replacing `rand` with `ones`, `zeros`,
or `randn` to see what they do.

By default, Julia works stores numbers is a high-precision format called
`Float64`. In ML we often don't need all those digits, and can ask Julia to
work with `Float32` instead. We can even ask for more digits using `BigFloat`.

In [5]:
x = rand(BigFloat, 5, 3)

5×3 Array{BigFloat,2}:
 0.866437  0.905824  0.127161
 0.657515  0.475191  0.912335
 0.292584  0.36587   0.464819
 0.922609  0.681581  0.621058
 0.705246  0.880376  0.897961

In [6]:
x = rand(Float32, 5, 3)

5×3 Array{Float32,2}:
 0.255317  0.556587  0.966125
 0.563024  0.532851  0.217162
 0.850266  0.157827  0.52388 
 0.767827  0.816424  0.787822
 0.952749  0.543761  0.270594

We can ask the array how many elements it has.

In [7]:
length(x)

15

Or, more specifically, what size it has.

In [8]:
size(x)

(5, 3)

We sometimes want to see some elements of the array on their own.

In [9]:
x

5×3 Array{Float32,2}:
 0.255317  0.556587  0.966125
 0.563024  0.532851  0.217162
 0.850266  0.157827  0.52388 
 0.767827  0.816424  0.787822
 0.952749  0.543761  0.270594

In [10]:
x[2, 3]

0.21716177f0

This means get the second row and the third column. We can also get every row
of the third column.

In [11]:
x[:, 3]

5-element Array{Float32,1}:
 0.96612453
 0.21716177
 0.5238805 
 0.78782177
 0.27059424

We can add arrays, and subtract them, which adds or subtracts each element of
the array.

In [12]:
x + x

5×3 Array{Float32,2}:
 0.510633  1.11317   1.93225 
 1.12605   1.0657    0.434324
 1.70053   0.315655  1.04776 
 1.53565   1.63285   1.57564 
 1.9055    1.08752   0.541188

In [13]:
x - x

5×3 Array{Float32,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

Julia supports a feature called *broadcasting*, using the `.` syntax. This
tiles small arrays (or single numbers) to fill bigger ones.

In [14]:
x .+ 1

5×3 Array{Float32,2}:
 1.25532  1.55659  1.96612
 1.56302  1.53285  1.21716
 1.85027  1.15783  1.52388
 1.76783  1.81642  1.78782
 1.95275  1.54376  1.27059

We can see Julia tile the column vector `1:5` across all rows of the larger
array.

In [15]:
zeros(5,5) .+ (1:5)

5×5 Array{Float64,2}:
 1.0  1.0  1.0  1.0  1.0
 2.0  2.0  2.0  2.0  2.0
 3.0  3.0  3.0  3.0  3.0
 4.0  4.0  4.0  4.0  4.0
 5.0  5.0  5.0  5.0  5.0

The x' syntax is used to transpose a column `1:5` into an equivalent row, and
Julia will tile that across columns.

In [16]:
zeros(5,5) .+ (1:5)'

5×5 Array{Float64,2}:
 1.0  2.0  3.0  4.0  5.0
 1.0  2.0  3.0  4.0  5.0
 1.0  2.0  3.0  4.0  5.0
 1.0  2.0  3.0  4.0  5.0
 1.0  2.0  3.0  4.0  5.0

We can use this to make a times table.

In [17]:
(1:5) .* (1:5)'

5×5 Array{Int64,2}:
 1   2   3   4   5
 2   4   6   8  10
 3   6   9  12  15
 4   8  12  16  20
 5  10  15  20  25

Finally, and importantly for machine learning, we can conveniently do things like
matrix multiply.

In [18]:
W = randn(5, 10)
x = rand(10)
W * x

5-element Array{Float64,1}:
  0.05421135695837431
  0.9946443858299661 
  0.762084141982013  
 -0.41157924083676123
  1.1820316570453544 

Julia's arrays are very powerful, and you can learn more about what they can
do [here](https://docs.julialang.org/en/v1/manual/arrays/).

### CUDA Arrays

CUDA functionality is provided separately by the [CuArrays
package](https://github.com/JuliaGPU/CuArrays.jl). If you have a GPU and CUDA
available, you can run `] add CuArrays` in a REPL or IJulia to get it.

Once CuArrays is loaded you can move any array to the GPU with the `cu`
function, and it supports all of the above operations with the same syntax.

In [19]:
# using CuArrays
# x = cu(rand(5, 3))

Automatic Differentiation
-------------------------

You probably learned to take derivatives in school. We start with a simple
mathematical function like

In [20]:
f(x) = 3x^2 + 2x + 1

f(5)

86

In simple cases it's pretty easy to work out the gradient by hand – here it's
`6x+2`. But it's much easier to make Flux do the work for us!

In [21]:
using Flux.Tracker: gradient

df(x) = gradient(f, x; nest =true)[1]

df(5)

32.0 (tracked)

You can try this with a few different inputs to make sure it's really the same
as `6x+2`. We can even do this multiple times (but the second derivative is a
fairly boring `6`).

In [22]:
ddf(x) = gradient(df, x)[1]

ddf(5)

6.0 (tracked)

Flux's AD can handle any Julia code you throw at it, including loops,
recursion and custom layers, so long as the mathematical functions you call
are differentiable. For example, we can differentiate a Taylor approximation
to the `sin` function.

In [23]:
mysin(x) = sum((-1)^k*x^(1+2k)/factorial(1+2k) for k in 0:5)

x = 0.5

mysin(x), gradient(mysin, x)

(0.4794255386041834, (0.8775825618898637 (tracked),))

In [24]:
sin(x), cos(x)

(0.479425538604203, 0.8775825618903728)

You can see that the derivative we calculated is very close to `cos(x)`, as we
expect.

This gets more interesting when we consider functions that take *arrays* as
inputs, rather than just a single number. For example, here's a function that
takes a matrix and two vectors (the definition itself is arbitrary)

In [25]:
using Flux.Tracker: gradient

myloss(W, b, x) = sum(W * x .+ b)

W = randn(3, 5)
b = zeros(3)
x = rand(5)

gradient(myloss, W, b, x)

([0.769483 0.167485 … 0.783535 0.00521952; 0.769483 0.167485 … 0.783535 0.00521952; 0.769483 0.167485 … 0.783535 0.00521952] (tracked), [1.0, 1.0, 1.0] (tracked), [-5.07166, -0.830381, -1.40381, 0.858376, -1.54479] (tracked))

Now we get gradients for each of the inputs `W`, `b` and `x`, which will come
in handy when we want to train models.

Because ML models can contain hundreds of parameters, Flux provides a slightly
different way of writing `gradient`. We instead mark arrays with `param` to
indicate that we want their derivatives. `W` and `b` represent the weight and
bias respectively.

In [26]:
using Flux.Tracker: param, back!, grad

W = param(randn(3, 5))
b = param(zeros(3))
x = rand(5)

y = sum(W * x .+ b)

-2.4185315794871083 (tracked)

Anything marked `param` becomes *tracked*, indicating that Flux keeping an eye
on its gradient. We can now call

In [27]:
back!(y) # Run backpropagation

grad(W), grad(b)

([0.175657 0.244909 … 0.461918 0.687567; 0.175657 0.244909 … 0.461918 0.687567; 0.175657 0.244909 … 0.461918 0.687567], [1.0, 1.0, 1.0])

We can now grab the gradients of `W` and `b` directly from those parameters.

This comes in handy when working with *layers*. A layer is just a handy
container for some parameters. For example, `Dense` does a linear transform
for you.

In [28]:
using Flux

m = Dense(10, 5)

x = rand(Float32, 10)

m(x)

Tracked 5-element Array{Float32,1}:
 -1.866909f0   
  1.6068742f0  
 -0.47442514f0 
 -0.078281775f0
  0.7389283f0  

In [29]:
m(x) == m.W * x .+ m.b

true

We can easily get the parameters of any layer or model with params with
`params`.

In [30]:
params(m)

Params([Float32[0.105143 0.0920902 … 0.434813 -0.625653; 0.0682838 0.42479 … 0.114949 0.146191; … ; 0.428199 -0.577396 … -0.419586 -0.320503; -0.0234319 -0.287929 … -0.347612 0.0374402] (tracked), Float32[0.0, 0.0, 0.0, 0.0, 0.0] (tracked)])

This makes it very easy to do backpropagation and get the gradient for all
parameters in a network, even if it has many parameters.

In [31]:
m = Chain(Dense(10, 5, relu), Dense(5, 2), softmax)

l = sum(Flux.crossentropy(m(x), [0.5, 0.5]))
back!(l)

grad.(params(m))

4-element Array{Array{Float32,N} where N,1}:
 [-0.0287585 -0.0199122 … -0.00242918 -0.0425191; 0.00377813 0.00261594 … 0.000319132 0.00558591; … ; 0.0318442 0.0220487 … 0.00268982 0.0470812; 0.0 0.0 … 0.0 0.0]
 [-0.0483765, 0.00635542, 0.0, 0.0535671, 0.0]                                                                                                                      
 [-0.0347323 -0.013319 … -0.107115 0.0; 0.0347323 0.013319 … 0.107115 0.0]                                                                                          
 [-0.100709, 0.100709]                                                                                                                                              

You don't have to use layers, but they can be convient for many simple kinds
of models and fast iteration.

The next step is to update our weights and perform optimisation. As you might be
familiar, *Gradient Descent* is a simple algorithm that takes the weights and steps
using a learning rate and the gradients. `weights = weights - learning_rate * gradient`.

In [32]:
using Flux.Tracker: update!

η = 0.1
for p in params(m)
  update!(p, -η * grad(p))
end

While this is a valid way of updating our weights, it can get more complicated as the
algorithms we use get more involved.

Flux comes with a bunch of pre-defined optimisers and makes writing our own really simple.
We just give it the learning rate η

In [33]:
opt = Descent(0.01)

Flux.Optimise.Descent(0.01)

`Training` a network reduces down to iterating on a dataset mulitple times, performing these
steps in order. Just for a quick implementation, let’s train a network that learns to predict
`0.5` for every input of 10 floats. `Flux` defines the `train!` function to do it for us.

In [34]:
data, labels = rand(10, 100), fill(0.5, 2, 100)
loss(x, y) = sum(Flux.crossentropy(m(x), y))
Flux.train!(loss, params(m), [(data,labels)], opt)

You don't have to use `train!`. In cases where aribtrary logic might be better suited,
you could open up this training loop like so:

```julia
  for d in training_set # assuming d looks like (data, labels)
    # our super logic
    l = loss(d...)
    Tracker.back!(l)
    opt()
  end
```

Training a Classifier
---------------------

Getting a real classifier to work might help cement the workflow a bit more.
[CIFAR10](url) is a dataset of 50k tiny training images split into 10 classes.

We will do the following steps in order:

* Load CIFAR10 training and test datasets
* Define a Convolution Neural Network
* Define a loss function
* Train the network on the training data
* Test the network on the test data

Loading the Dataset

[Metalhead.jl](https://github.com/FluxML/Metalhead.jl) is an excellent package
that has a number of predefined and pretrained computer vision models.
It also has a number of dataloaders that come in handy to load datasets.

In [35]:
using Statistics

using CuArrays

In [36]:
using Flux, Flux.Tracker, Flux.Optimise
using Metalhead, Images
using Metalhead: trainimgs
using Images.ImageCore
using Flux: onehotbatch, onecold
using Base.Iterators: partition

The image will give us an idea of what we are dealing with.
![title](https://pytorch.org/tutorials/_images/cifar10.png)

In [37]:
Metalhead.download(CIFAR10)
X = trainimgs(CIFAR10)
labels = onehotbatch([X[i].ground_truth.class for i in 1:50000],1:10)

10×50000 Flux.OneHotMatrix{Array{Flux.OneHotVector,1}}:
 false  false  false  false  false  …  false  false  false  false  false
 false  false  false  false   true     false  false  false   true   true
 false  false  false  false  false      true  false  false  false  false
 false  false  false  false  false     false  false  false  false  false
 false  false  false   true  false     false  false  false  false  false
 false  false  false  false  false  …  false  false  false  false  false
  true  false  false  false  false     false   true  false  false  false
 false  false  false  false  false     false  false  false  false  false
 false  false  false  false  false     false  false  false  false  false
 false   true   true  false  false     false  false   true  false  false

Let's take a look at a random image from the dataset

In [38]:
image(x) = x.img # handy for use later
ground_truth(x) = x.ground_truth
image.(X[rand(1:end, 10)])

The images are simply 32 X 32 matrices of numbers in 3 channels (R,G,B). We can now
arrange them in batches of say, 1000 and keep a validation set to track our progress.
This process is called minibatch learning, which is a popular method of training
large neural networks. Rather that sending the entire dataset at once, we break it
down into smaller chunks (called minibatches) that are typically chosen at random,
and train only on them. It is shown to help with escaping
[saddle points](https://en.wikipedia.org/wiki/Saddle_point).

Defining a `getarray` function would help in converting the matrices to `Float` type.

In [39]:
getarray(X) = float.(permutedims(channelview(X), (2, 3, 1)))
imgs = [getarray(X[i].img) for i in 1:50000]

50000-element Array{Array{Float32,3},1}:
 [0.231373 0.168627 … 0.596078 0.580392; 0.0627451 0.0 … 0.466667 0.478431; … ; 0.705882 0.678431 … 0.380392 0.32549; 0.694118 0.658824 … 0.592157 0.482353]

[0.243137 0.180392 … 0.490196 0.486275; 0.0784314 0.0 … 0.32549 0.341176; … ; 0.545098 0.482353 … 0.243137 0.207843; 0.564706 0.505882 … 0.462745 0.360784]

[0.247059 0.176471 … 0.4 0.403922; 0.0784314 0.0 … 0.196078 0.223529; … ; 0.376471 0.164706 … 0.133333 0.133333; 0.454902 0.368627 … 0.329412 0.282353]                    
 [0.603922 0.494118 … 0.341176 0.309804; 0.54902 0.568627 … 0.301961 0.278431; … ; 0.647059 0.611765 … 0.482353 0.513726; 0.639216 0.619608 … 0.560784 0.560784]

[0.694118 0.537255 … 0.352941 0.317647; 0.627451 0.6 … 0.313726 0.286275; … ; 0.603922 0.596078 … 0.447059 0.47451; 0.580392 0.580392 … 0.52549 0.521569]

[0.733333 0.533333 … 0.278431 0.27451; 0.662745 0.603922 … 0.243137 0.239216; … ; 0.501961 0.509804 … 0.470588 0.513726; 0.470588 0.478431 … 0.556863 0.564

The first 49k images (in batches of 1000) will be our training set, and the rest is
for validation. `partition` handily breaks down the set we give it in consecutive parts
(1000 in this case). `cat` is a shorthand for concatenating multi-dimensional arrays along
any dimension.

In [40]:
train = gpu.([(cat(imgs[i]..., dims = 4), labels[:,i]) for i in partition(1:49000, 1000)])
valset = 49001:50000
valX = cat(imgs[valset]..., dims = 4) |> gpu
valY = labels[:, valset] |> gpu

10×1000 Flux.OneHotMatrix{CuArrays.CuArray{Flux.OneHotVector,1}}:
 false  false  false  false   true  …  false  false  false  false  false
 false  false  false  false  false     false  false  false   true   true
 false  false  false  false  false      true  false  false  false  false
 false  false  false  false  false     false  false  false  false  false
 false  false   true  false  false     false  false  false  false  false
 false  false  false  false  false  …  false  false  false  false  false
 false  false  false  false  false     false   true  false  false  false
 false  false  false  false  false     false  false  false  false  false
  true  false  false  false  false     false  false  false  false  false
 false   true  false   true  false     false  false   true  false  false

## Defining the Classifier
--------------------------
Now we can define our Convolutional Neural Network (CNN).

A convolutional neural network is one which defines a kernel and slides it across a matrix
to create an intermediate representation to extract features from. It creates higher order
features as it goes into deeper layers, making it suitable for images, where the strucure of
the subject is what will help us determine which class it belongs to.

In [41]:
m = Chain(
  Conv((5,5), 3=>16, relu),
  MaxPool((2,2)),
  Conv((5,5), 16=>8, relu),
  MaxPool((2,2)),
  x -> reshape(x, :, size(x, 4)),
  Dense(200, 120),
  Dense(120, 84),
  Dense(84, 10),
  softmax) |> gpu
# We will use a crossentropy loss and an Momentum optimiser here. Crossentropy will be a
# good option when it comes to working with mulitple independent classes. Momentum gradually
# lowers the learning rate as we proceed with the training. It helps maintain a bit of
# adaptivity in our optimisation, preventing us from over shooting from our desired destination.
using Flux: crossentropy, Momentum

loss(x, y) = sum(crossentropy(m(x), y))
opt = Momentum(params(m), 0.01)

│   caller = top-level scope at none:0
└ @ Core none:0


#24 (generic function with 1 method)

We can start writing our train loop where we will keep track of some basic accuracy
numbers about our model. We can define an `accuracy` function for it like so.

In [42]:
accuracy(x, y) = mean(onecold(m(x), 1:10) .== onecold(y, 1:10))

accuracy (generic function with 1 method)

## Training
-----------

Training is where we do a bunch of the interesting operations we defined earlier,
and see what our net is capable of. We will loop over the dataset 10 times and
feed the inputs to the neural network and optimise.

In [43]:
epochs = 10

for epoch = 1:epochs
  for d in train
    l = loss(d...)
    back!(l)
    opt()
    @show accuracy(valX, valY)
  end
end

│   caller = ip:0x0
└ @ Core :-1
accuracy(valX, valY) = 0.124
accuracy(valX, valY) = 0.092
accuracy(valX, valY) = 0.136
accuracy(valX, valY) = 0.125
accuracy(valX, valY) = 0.137
accuracy(valX, valY) = 0.135
accuracy(valX, valY) = 0.134
accuracy(valX, valY) = 0.127
accuracy(valX, valY) = 0.134
accuracy(valX, valY) = 0.136
accuracy(valX, valY) = 0.126
accuracy(valX, valY) = 0.104
accuracy(valX, valY) = 0.102
accuracy(valX, valY) = 0.099
accuracy(valX, valY) = 0.1
accuracy(valX, valY) = 0.096
accuracy(valX, valY) = 0.097
accuracy(valX, valY) = 0.095
accuracy(valX, valY) = 0.097
accuracy(valX, valY) = 0.099
accuracy(valX, valY) = 0.101
accuracy(valX, valY) = 0.104
accuracy(valX, valY) = 0.103
accuracy(valX, valY) = 0.104
accuracy(valX, valY) = 0.106
accuracy(valX, valY) = 0.104
accuracy(valX, valY) = 0.102
accuracy(valX, valY) = 0.101
accuracy(valX, valY) = 0.104
accuracy(valX, valY) = 0.111
accuracy(valX, valY) = 0.122
accuracy(valX, valY) = 0.127
accuracy(valX, valY) = 0.134
accuracy(val

Seeing our training routine unfold gives us an idea of how the network learnt the
This is not bad for a small hand-written network, trained for a limited time.

Training on a GPU
-----------------

The `gpu` functions you see sprinkled through this bit of the code tell Flux to move
these entities to an available GPU, and subsequently train on it. No extra faffing
about required! The same bit of code would work on any hardware with some small
annotations like you saw here.

## Testing the Network
----------------------

We have trained the network for 100 passes over the training dataset. But we need to
check if the network has learnt anything at all.

We will check this by predicting the class label that the neural network outputs, and
checking it against the ground-truth. If the prediction is correct, we add the sample
to the list of correct predictions. This will be done on a yet unseen section of data.

Okay, first step. Let us perform the exact same preprocessing on this set, as we did
on our training set.

In [44]:
valset = valimgs(CIFAR10)
valimg = [getarray(valset[i].img) for i in 1:10000]
labels = onehotbatch([valset[i].ground_truth.class for i in 1:10000],1:10)
test = gpu.([(cat(valimg[i]..., dims = 4), labels[:,i]) for i in partition(1:10000, 1000)])

10-element Array{Tuple{CuArrays.CuArray{Float32,4},Flux.OneHotMatrix{CuArrays.CuArray{Flux.OneHotVector,1}}},1}:
 ([0.619608 0.623529 … 0.494118 0.454902; 0.596078 0.592157 … 0.490196 0.466667; … ; 0.239216 0.192157 … 0.113725 0.0784314; 0.211765 0.219608 … 0.133333 0.0823529]

[0.439216 0.435294 … 0.356863 0.333333; 0.439216 0.431373 … 0.356863 0.345098; … ; 0.454902 0.4 … 0.321569 0.25098; 0.419608 0.411765 … 0.329412 0.262745]

[0.192157 0.184314 … 0.141176 0.129412; 0.2 0.156863 … 0.12549 0.133333; … ; 0.658824 0.580392 … 0.494118 0.419608; 0.627451 0.584314 … 0.505882 0.431373]

[0.921569 0.905882 … 0.913726 0.909804; 0.933333 0.921569 … 0.92549 0.921569; … ; 0.321569 0.180392 … 0.72549 0.733333; 0.333333 0.243137 … 0.705882 0.729412]

[0.921569 0.905882 … 0.913726 0.909804; 0.933333 0.921569 … 0.92549 0.921569; … ; 0.376471 0.223529 … 0.784314 0.792157; 0.396078 0.294118 … 0.764706 0.784314]

[0.921569 0.905882 … 0.913726 0.909804; 0.933333 0.921569 … 0.92549 0.921569; … ; 0.3215

Next, display some of the images from the test set.

In [45]:
ids = rand(1:10000, 10)
image.(valset[ids])

The outputs are energies for the 10 classes. Higher the energy for a class, the more the
network thinks that the image is of the particular class. Every column corresponds to the
output of one image, with the 10 floats in the column being the energies.

Let's see how the model fared.

In [46]:
rand_test = getarray.(image.(valset[ids]))
rand_test = cat(rand_test..., dims = 4) |> gpu
rand_truth = ground_truth.(valset[ids])
m(rand_test)

Tracked 10×10 CuArrays.CuArray{Float32,2}:
 0.216391    0.00148046  0.010142    …  0.398581    0.379376    0.389975   
 0.0594824   0.00220866  0.00426526     0.0114588   0.0724831   0.0778077  
 0.0377306   0.0304957   0.11987        0.113931    0.0928419   0.0253539  
 0.0143691   0.143692    0.242474       0.0347816   0.0174846   0.00640255 
 0.0148713   0.069811    0.0716465      0.135167    0.0662434   0.00440407 
 0.00944025  0.20174     0.279126    …  0.0443433   0.012111    0.0096757  
 0.00504547  0.449163    0.188084       0.00959472  0.00825641  0.000422212
 0.00168904  0.0777448   0.0549062      0.0836554   0.0155652   0.00183072 
 0.609131    0.00122658  0.0123387      0.126563    0.260358    0.446175   
 0.0318502   0.0224376   0.0171479      0.0419248   0.0752801   0.0379529  

This looks similar to how we would expect the results to be. At this point, it's a good
idea to see how our net actually performs on new data, that we have prepared.

In [47]:
accuracy(test[1]...)

0.392

This is much better than random chance set at 10% (since we only have 10 classes), and
not bad at all for a small hand written network like ours.

Let's take a look at how the net performed on all the classes performed individually.

In [48]:
class_correct = zeros(10)
class_total = zeros(10)
for i in 1:10
  preds = m(test[i][1])
  lab = test[i][2]
  for j = 1:1000
    pred_class = findmax(preds[:, j])[2]
    actual_class = findmax(lab[:, j])[2]
    if pred_class == actual_class
      class_correct[pred_class] += 1
    end
    class_total[actual_class] += 1
  end
end

class_correct ./ class_total

10-element Array{Float64,1}:
 0.423
 0.496
 0.16 
 0.095
 0.312
 0.446
 0.603
 0.419
 0.494
 0.457

The spread seems pretty good, with certain classes performing significantly better than the others.
Why should that be?