# Single neuron using Flux.jl

## Read in and process data

In [13]:
using CSV

apples = CSV.read("Apple_Golden_1.dat", delim='\t')
bananas = CSV.read("bananas.dat", delim='\t');

In [14]:
col1 = :red
col2 = :green

x_apples  = [ [apples[i, col1], apples[i, col2]] for i in 1:size(apples)[1] ]
x_bananas = [ [bananas[i, col1], bananas[i, col2]] for i in 1:size(bananas)[1] ]

xs = vcat(x_apples, x_bananas)

ys = vcat( zeros(size(x_apples)[1]), ones(size(x_bananas)[1]) );

The input data is in `xs` and the labels in `y`

## Using Flux.jl

In [6]:
using Flux

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/dpsanders/.julia/lib/v0.6/Flux.ji for module Flux.
[39m

The function $\sigma$ that we have been using is predefined by Flux:

In [8]:
σ

σ (generic function with 3 methods)

In [9]:
methods(σ)

In [10]:
?σ

"[36mσ[39m" can be typed by [36m\sigma<tab>[39m

search: [1mσ[22m



```
σ(x) = 1 / (1 + exp(-x))
```

Classic [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) activation function.

```
1 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠔⠒⠉⠉⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⣀⠤⠚⠁⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⡤⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⢀⡔⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡔⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⡔⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡠⠜⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⠜⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠜⠁⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡠⠚⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⠀⠀⠀⠀⠀⢀⡤⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  │⠀⠀⣀⣀⠤⠔⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
0 │⠋⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
  -3                      0                      3
```


We can make a neuron in a simple way:

In [14]:
model = Dense(2, 1, σ)

Dense(2, 1, NNlib.σ)

In [15]:
typeof(model)

Flux.Dense{NNlib.#σ,TrackedArray{…,Array{Float64,2}},TrackedArray{…,Array{Float64,1}}}

We have made an object of type `Dense`, defined by `Flux`. This represents a "dense neural network layer" (see later).
Inside the object live the parameters that we will modify during the learning process:

In [16]:
model.W

Tracked 1×2 Array{Float64,2}:
 1.12843  0.788773

In [17]:
model.b

Tracked 1-element Array{Float64,1}:
 0.0

The fact that `W` and `b` are of size $1 \times 2$ and $1$, respectively, comes from the `(2, 1)` pair in the call to the `Dense` constructor when we created `model`. A "tracked" array is a special type provided by `Flux.jl` that is able to calculate ("track") derivatives via reverse-mode automatic differentiation, usually called **backpropagation** in the context of neural networks. This is more efficient in this context than calculating the derivatives via forward-mode automatic differentiation, as we did previously using the `ForwardDiff.jl` package.

## 

In [53]:
W = rand(1, 2)
b = rand(1)

predict(x) = σ.(W*x + b)
loss(x, y) = sum(abs2, (predict(x) .- y) )

x, y = rand(2), rand(1) 
loss(x, y) 

0.08786885598447765

We will now see how `Flux.jl` facilitates the type of calculations that we have been doing.
To do so, we use the `param` function to define objects that will contain both the values of 
the parameters `W` and `b` *and* the derivatives. These derivatives will be the derivatives of the loss 
function with respect to `W` and `b` that we calculated previously using `ForwardDiff`.

Let's start, as usual, by setting up some random initial values for the parameters:

In [61]:
W_data = rand(1, 2)  
b_data = rand(1)

W_data, b_data

([0.718982 0.414688], [0.692976])

We now set up `Flux.jl` objects that will contain these values *and* their derivatives, and allow to propagate
this information around:

In [62]:
W = param( W_data )
b = param( b_data )

predict(x) = σ.(W*x + b)
loss(x, y) = sum( (predict(x) .- y).^2 )

x, y = rand(2), rand(1) 
l = loss(x, y) 

Tracked 0-dimensional Array{Float64,0}:
0.0443256

In [63]:
fieldnames(W)

4-element Array{Symbol,1}:
 :ref 
 :f   
 :data
 :grad

We see that the data is indeed inside the object:

In [64]:
W.data  # the random 

1×2 Array{Float64,2}:
 0.718982  0.414688

Initially, the derivatives are zero:

In [57]:
W.grad

1×2 Array{Float64,2}:
 0.0  0.0

Having set up the structure, we can now propagate the derivative information backwards 
from the `loss` function to all of the objects that are used to calculate it:

In [65]:
using Flux.Tracker

back!(l)   # backpropagate derivatives of the loss function

In [66]:
W.grad

1×2 Array{Float64,2}:
 -0.00819256  -0.0500179

In [67]:
b.grad

1-element Array{Float64,1}:
 -0.0821254

We can now use this structure to do stochastic gradient descent, just as we did in the previous notebook.

**Exercise:** Implement this.

In [76]:
function stochastic_gradient_descent(loss, xs, ys, W, b, N=1000)

    η = 0.01

    for i in 1:N
        
        which = rand(1:length(xs))  # choose a data point
        
        xx = xs[which]
        yy = ys[which]
        
        l = loss(xx, yy)
        back!(l)
        
        W.data -= η * W.grad
        b.data -= η * b.grad
    
    end
    
    return W, b
    
end
    

stochastic_gradient_descent (generic function with 2 methods)

In [77]:
b

Tracked 1-element Array{Float64,1}:
 0.692976

In [78]:
ys

982-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

In [82]:
W_final, b_final = stochastic_gradient_descent(loss, xs, ys, W, b)

(param([-70.0661 -72.9035]), param([87.9345]))

In [83]:
W_final

Tracked 1×2 Array{Float64,2}:
 -70.0661  -72.9035

In [84]:
b_final

Tracked 1-element Array{Float64,1}:
 87.9345

In [80]:
using Plots; gr()

Plots.GRBackend()

In [81]:
scatter(first.(x_apples), last.(x_apples), m=:cross)
scatter!(first.(x_bananas), last.(x_bananas))

Let's draw the function that the network has learned, together with the data:

In [89]:
heatmap(0:0.01:1, 0:0.01:1, (x,y)->predict([x, y]).data[1])

scatter!(first.(x_apples), last.(x_apples), m=:cross)
scatter!(first.(x_bananas), last.(x_bananas))

TODO: Animation of learning process

## Automation with Flux.jl

We will need to repeat the above process for a lot of different systems.
Fortunately, Flux.jl provides us with tools to automate this.

Firstly, we create the model:

In [15]:
using Flux

In [16]:
model = Dense(2, 1, σ)

Dense(2, 1, NNlib.σ)

In [17]:
model.W

Tracked 1×2 Array{Float64,2}:
 -0.466978  0.775266

In [18]:
model.b

Tracked 1-element Array{Float64,1}:
 0.0

We can use the `model` object just like a function to apply it to data:

In [19]:
model(rand(2))

Tracked 1-element Array{Float64,1}:
 0.555819

Flux has various loss functions built in:

In [20]:
loss(x, y) = Flux.mse(model(x), y)

loss (generic function with 1 method)

In [21]:
data = zip(xs, ys)

Base.Iterators.Zip2{Array{Array{Float64,1},1},Array{Float64,1}}(Array{Float64,1}[[0.708703, 0.641282], [0.648376, 0.553169], [0.647237, 0.553302], [0.647963, 0.55323], [0.647653, 0.554047], [0.648491, 0.553821], [0.647974, 0.554518], [0.649307, 0.554399], [0.648141, 0.554708], [0.64984, 0.555665]  …  [0.524028, 0.452379], [0.523906, 0.452571], [0.523823, 0.4514], [0.522489, 0.449973], [0.517573, 0.444391], [0.515956, 0.441912], [0.517585, 0.444827], [0.510357, 0.436022], [0.508873, 0.43433], [0.528205, 0.440139]], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0])

In [22]:
collect(data)

982-element Array{Tuple{Array{Float64,1},Float64},1}:
 ([0.708703, 0.641282], 0.0)
 ([0.648376, 0.553169], 0.0)
 ([0.647237, 0.553302], 0.0)
 ([0.647963, 0.55323], 0.0) 
 ([0.647653, 0.554047], 0.0)
 ([0.648491, 0.553821], 0.0)
 ([0.647974, 0.554518], 0.0)
 ([0.649307, 0.554399], 0.0)
 ([0.648141, 0.554708], 0.0)
 ([0.64984, 0.555665], 0.0) 
 ([0.648446, 0.555576], 0.0)
 ([0.709808, 0.632473], 0.0)
 ([0.650164, 0.555766], 0.0)
 ⋮                          
 ([0.52913, 0.44031], 1.0)  
 ([0.528731, 0.456548], 1.0)
 ([0.524028, 0.452379], 1.0)
 ([0.523906, 0.452571], 1.0)
 ([0.523823, 0.4514], 1.0)  
 ([0.522489, 0.449973], 1.0)
 ([0.517573, 0.444391], 1.0)
 ([0.515956, 0.441912], 1.0)
 ([0.517585, 0.444827], 1.0)
 ([0.510357, 0.436022], 1.0)
 ([0.508873, 0.43433], 1.0) 
 ([0.528205, 0.440139], 1.0)

In [27]:
opt = SGD([model.W, model.b], 0.01)
# give a list of the parameters that will be modified

(::#71) (generic function with 1 method)

In [28]:
for i in 1:100
    Flux.train!(loss, data, opt)
end

In [29]:
model.W

Tracked 1×2 Array{Float64,2}:
 -5.98412  -5.17245

In [30]:
model.b

Tracked 1-element Array{Float64,1}:
 7.00206

In [31]:
opt = SGD(params(model), 0.01)

(::#71) (generic function with 1 method)

In [32]:
params(model)

2-element Array{Any,1}:
 param([-5.98412 -5.17245])
 param([7.00206])          