# Julia Things

### Environment

First things first. Let us set up the environment with the requried packages for this notebook:

In [3]:
for p in ("Knet", "Plots", "Plotly.jl")
    Pkg.installed(p) == nothing && Pkg.add(p)
end

using Knet, Plots
gr()

Knet.gpu(0); # set the desired GPU to use
atype = Array{Float32}; # atype = KnetArray{Float32} for gpu usage, Array{Float32} for cpu. 

println("OS: ", Sys.KERNEL)
println("Julia: ", VERSION)
println("Knet: ", Pkg.installed("Knet"))
println("GPU: ", readstring(`nvidia-smi --query-gpu=name --format=csv,noheader`))

OS: Linux
Julia: 0.6.0
Knet: 0.8.5+
GPU: NVS 310
TITAN X (Pascal)



### New Stuff

In this notebook we introduce the following Julia/Knet packages and functions:

* Julia's plotting macro [@layout](https://github.com/JuliaPlots/PlotDocs.jl/blob/master/docs/src/layouts.md):The layout macro can be used to create an animation with subplots. Use the layout keyword, and optionally the convenient `@layout` macro to generate arbitrarily complex subplot layouts.
* Knet's function [minibach](http://denizyuret.github.io/Knet.jl/latest/reference.html#Knet.minibatch): A convenient function used to rearrange the data into chunks, sort the data, and create a new data type `MB`, minibatched data, which has fields type MB; x; y; batchsize; length; partial; indices; xsize; ysize; xtype; ytype; end. As we will see, a data type `MB` can be conveniently used with other Knet functions.

# Linear regression

Powerful ML libraries can eliminate repetitive work, but if you rely too much on abstractions, you might never learn how neural networks really work under the hood. So for this first example, let's get our hands dirty and build everything from scratch, relying only on Knet's `grad` and `KnetArray`. First, we'll import the same dependencies as in the [autograd chapter](../chapter01_crashcourse/autograd.ipynb).

## Linear regression

To get our feet wet, we'll start off by looking at the problem of regression.
This is the task of predicting a *real valued target* $y\in\mathbb{R}$ given a data point $x\in\mathbb{R}^d$.
In linear regression, the simplest and still perhaps the most useful approach,
we assume that prediction can be expressed as a *linear* combination of the input features 
(thus giving the name *linear* regression)

$$\hat{y} = w_1 \cdot x_1 + ... + w_d \cdot x_d + b$$

More generally, given a collection of data points $X$, where $X$ is an $n\times d$ matrx, and corresponding target values $\boldsymbol{y}\in\mathbb{R}^n$, 
we'll try to find the *weight* vector $\boldsymbol{w}\in\mathbb{R}^d$ and bias term $\boldsymbol{b}\in\mathbb{R}^n$ 
(also called an *offset* or *intercept*)
that approximately associate data points $x_i\in\mathbb{R}^d$ with their corresponding labels $y_i\in\mathbb{R}^n$. 
Using slightly more advanced math notation, we can express the predictions $\boldsymbol{\hat{y}}$
corresponding to a collection of datapoints $X$ via the matrix-vector product:

$$\boldsymbol{\hat{y}} = X \boldsymbol{w} + b$$


Before we can get going, we will need two more things:

* Some way to measure the quality of the current model  
* Some way to manipulate the model to improve its quality

### Square loss

In order to say whether we've done a good job, 
we need some way to measure the quality of a model. 
Generally, we will define a *loss function*
that says *how far* are our predictions from the correct answers.
For the classical case of linear regression, 
we usually focus on the squared error.
Specifically, our loss will be the sum, over all examples, of the squared error $(y_i-\hat{y}_i)^2)$ on each:

$$\ell(y, \hat{y}) = \sum_{i=1}^n (\hat{y}_i-y_i)^2.$$


For one-dimensional data, we can easily visualize the relationship between our single feature and the target variable. It's also easy to visualize a linear predictor and it's error on each example. 
Note that squared loss *heavily penalizes outliers*. For the visualized predictor below, the lone outlier would contribute most of the loss.

![](../img/linear-regression.png)

### Manipulating the model

For us to minimize the error,
we need some mechanism to alter the model.
We do this by choosing values of the *parameters*
$\boldsymbol{w}$ and $\boldsymbol{b}$.
This is the only job of the learning algorithm.
Take training data ($X$, $y$) and the functional form of the model $\boldsymbol{\hat{y}} = X\boldsymbol{w} + \boldsymbol{b}$.
Learning then consists of choosing the best possible $\boldsymbol{w}$ and $\boldsymbol{b}$ based on the available evidence.

### Historical note

You might reasonably point out that linear regression is a classical statistical model.
[According to Wikipedia](https://en.wikipedia.org/wiki/Regression_analysis#History), 
Legendre first developed the method of least squares regression in 1805,
which was shortly thereafter rediscovered by Gauss in 1809. 
Presumably, Legendre, who had Tweeted about the paper several times,
was peeved that Gauss failed to cite his arXiv preprint. 

![Legendre](../img/legendre.jpeg)

Matters of provenance aside, you might wonder - if Legendre and Gauss 
worked on linear regression, does that mean they were the original deep learning researchers?
And if linear regression doesn't wholly belong to deep learning, 
then why are we presenting a linear model 
as the first example in a tutorial series on neural networks? 
Well it turns out that we can express linear regression 
as the simplest possible (useful) neural network. 
A neural network is just a collection of nodes (aka neurons) connected by directed edges. 
In most networks, we arrange the nodes into layers with each feeding its output into the layer above. 
To calculate the value of any node, we first perform a weighted sum of the inputs (according to weights ``w``) 
and then apply an *activation function*. 
For linear regression, we only have two layers, one corresponding to the input (depicted in orange) 
and a one-node layer (depicted in green) correspnding to the ouput.
For the output node the activation function is just the identity function.

![](../img/onelayer.png)

While you certainly don't have to view linear regression through the lens of deep learning, 
you can (and we will!).
To ground the concepts that we just discussed in code, 
let's actually code up a neural network for linear regression from scratch.

To get going, we will generate a simple synthetic dataset by sampling random data points ``X[i]`` and corresponding labels ``y[i]`` in the following manner. Our inputs will each be sampled from a random normal distribution with mean $0$ and variance $1$. Our features will be independent. Another way of saying this is that they will have diagonal covariance.  The labels will be generated accoding to the *true* labeling function `y[i] = 2 * X[i][0]- 3.4 * X[i][1] + 4.2 + noise` where the noise is drawn from a random gaussian with mean ``0`` and variance ``.01``. We could express the labeling function in mathematical notation as:
$$y = X \cdot w + b + \eta, \quad \text{for } \eta \sim \mathcal{N}(0,\sigma^2)$$ 

In [2]:
# set-up
Knet.gpu(0); # set GPU 
atype = KnetArray; # can be changed to Array for CPU computations
srand(1);

In [3]:
num_inputs   = 2
num_outputs  = 1
num_examples = 10000

real_fn(X) = 2X[:, 1] - 3.4X[:, 2] + 4.2

X, noise = map(atype, [randn(num_examples, num_inputs), 0.1*randn(num_examples, 1)]);
y = real_fn(X) .+ noise;

display(X[1, :]), display(y[1]);

2-element Knet.KnetArray{Float64,1}:
 0.297288 
 0.0721514

4.537006989437179

Notice that each row in ``X`` consists of a 2-dimensional data point and that each row in ``Y`` consists of a 1-dimensional target value. Also notice that because our synthetic features `X` are of type `atype` (on GPU 0) and because our noise is also of type `atype`, the labels `y`, produced by combining `X` and `noise` in `real_fn` are also of type `atype`. We can confirm that for any randomly chosen point, a linear combination with the (known) optimal parameters produces a prediction that is indeed close to the target value:

In [4]:
display(real_fn(X[1:1,:]))

1-element Knet.KnetArray{Float64,1}:
 4.54926

We can visualize the correspondence between our second feature (``X[:, 1]``) and the target values ``Y`` by generating a scatter plot:

In [5]:
scatter(Array(X[:, 1]), Array(y))

Note that we have to move the data from the GPU to the CPU to plot it, and we this by converting `atype=KnetArray` to `Array`.

## Data iterators

Once we start working with neural networks, we're going to need to iterate through our data points quickly. We'll also want to be able to grab batches of ``k`` data points at a time and also shuffle our data. Let us create a data iterator from scratch. The steps to follow are: 

* Randomly shuffle the data and split it into a train and test sets as defined by the `test` split ratio.
* Divide the train and test sets into minibatches

In [6]:
function splitdata(X, y, test=0.1, shuffle=true)

    n = round(Int, (1 - test) * size(X, 1));
    if shuffle; r = randperm(size(X, 1)); else; r = 1:size(X, 1); end

    xtrn = X[r[1:n], :];
    ytrn = y[r[1:n]];
    xtst = X[r[n+1:end], :];
    ytst = y[r[n+1:end]];
    
    return xtrn, ytrn, xtst, ytst
end

splitdata (generic function with 3 methods)

In [7]:
function minibatch(X, y; batch_size=100, atype=KnetArray)
    
    nbatch = div(size(X, 1), batch_size);
    data   = [map(atype, [zeros(batch_size, 2), zeros(batch_size, 1)]) for i=1:nbatch]
    k = 1
    for n = 1:nbatch
        data[n][1][:,:] += X[k:k + batch_size - 1, :]
        data[n][2][:] += y[k:k + batch_size - 1]  
        k += batch_size
    end
    
    return data
end

minibatch (generic function with 1 method)

In [8]:
xtrn, ytrn, xtst, ytst = splitdata(X, y);
train_data = minibatch(xtrn, ytrn, batch_size=4);
test_data  = minibatch(xtst, ytst, batch_size=4);

we can easily fetch batches by iterating over `train_data`. First, let's just grab one batch and break out of the loop.:

In [9]:
for (xbatch, ybatch) in train_data
    display(xbatch)
    display(ybatch)
    break
end

4×2 Knet.KnetArray{Float64,2}:
 -0.562271   1.14743 
 -0.962688  -0.764249
 -0.93235    0.559942
  0.356588   1.38153 

4×1 Knet.KnetArray{Float64,2}:
 -0.838251
  4.87675 
  0.408143
  0.304118

Notice that the data lives on the GPU since it's of type KnetArray

In [10]:
length(train_data)

2250

Notice that the length of our training data is 2250. This is because 10% of the total samples (10000) were set aside for testing. Thus, 9000/4 = 2250. 

## Model parameters

Now let's allocate some memory for our parameters and set their initial values. We initialize the matrix weights to a normal distribution and biases to zero:

In [11]:
w = map(atype, [randn(num_inputs, num_outputs), zeros(num_outputs, 1)])

2-element Array{Knet.KnetArray{Float64,2},1}:
 Knet.KnetArray{Float64,2}(Knet.KnetPtr(Ptr{Void} @0x000001020da00000, 16, 0, nothing), (2, 1))
 Knet.KnetArray{Float64,2}(Knet.KnetPtr(Ptr{Void} @0x000001020da00200, 8, 0, nothing), (1, 1)) 

In the succeeding cells, we're going to update these parameters to better fit our data. This will involve taking the gradient (a multi-dimensional derivative) of some *loss function* with respect to the parameters. We'll update each parameter in the direction that reduces the loss.

## Neural networks

Next we'll want to define our model. In this case, we'll be working with linear models, the simplest possible *useful* neural network. To calculate the output of the linear model, we simply multiply a given input with the model's weights (``w``), and add the offset ``b``.

In [12]:
predict(w, x) = x * w[1] .+ w[2]

predict (generic function with 1 method)

Ok, that was SUPER easy. 

## Loss function

Train a model means making it better and better over the course of a period of training. But in order for this goal to make any sense at all, we first need to define what *better* means in the first place. In this case, we'll use the squared distance between our prediction and the true value. 

In [13]:
loss(w,x,y) = sum(abs2, y - predict(w, x)) / size(x,1)

loss (generic function with 1 method)

The variable w is a list of parameters (it could be a Tuple, Array, or Dict), x is the input and y is the desired output. To train this model, we want to adjust its parameters to reduce the loss on given training examples. The direction in the parameter space in which the loss reduction is maximum is given by the negative gradient of the loss. Knet uses the higher-order function [grad](http://denizyuret.github.io/Knet.jl/latest/reference.html#AutoGrad.grad) from [AutoGrad.jl](https://github.com/denizyuret/AutoGrad.jl) to compute the gradient direction:

In [14]:
lossgradient = grad(loss)

(::gradfun) (generic function with 1 method)

Note that [grad](http://denizyuret.github.io/Knet.jl/latest/reference.html#AutoGrad.grad) is a higher-order function that takes and returns other functions. The `lossgradient` function takes the same arguments as `loss`, e.g. `dw = lossgradient(w,x,y)`. Instead of returning a loss value, `lossgradient` returns `dw`, the gradient of the loss with respect to its first argument `w`. The type and size of `dw` is identical to `w`, each entry in `dw` gives the derivative of the loss with respect to the corresponding entry in `w`.

## Optimizer

It turns out that linear regression actually has a closed-form solution. However, most interesting models that we'll care about cannot be solved analytically. So we'll solve this problem by stochastic gradient descent (Sgd). At each step, we'll estimate the gradient of the loss with respect to our weights, using one batch randomly drawn from our dataset. Then, we'll update our parameters a small amount in the direction that reduces the loss. The size of the step is determined by the *learning rate* ``lr``. 

We can perform gradient descent with the function `update!(weights, gradients, params)`, which updates the weights using their gradients and the optimization algorithm parameters specified by `params`. The 2-arg version defaults to the `Sgd` algorithm with learning rate `lr` and gradient clip `gclip`. `gclip==0` indicates no clipping. The weights and possibly gradients and params are modified in-place.

## Execute training loop

Now that we have all the pieces, we just need to wire them together by writing a training loop. 
First we'll define ``epochs``, the number of passes to make over the dataset. Then for each pass, we'll iterate through ``train_data``, grabbing batches of examples and their corresponding labels. 

For each batch, we'll go through the following ritual:
     
* Generate predictions (``yhat``) and the loss (``loss``) by executing a forward pass through the network.
* Calculate gradients by making a backwards pass through the network (``loss.backward()``). 
* Update the model parameters by invoking our SGD optimizer.     



In [15]:
function train(w, train_data; lr=1e-4, epochs=10)
    train_loss = zeros(length(train_data))
    for epoch = 1:epochs
        for (xbatch, ybatch) in train_data
            train_loss[epoch] += loss(w, xbatch, ybatch);
            g = lossgradient(w, xbatch, ybatch)
            for i = 1:length(w)
                w[i] -= lr * g[i]
            end
        end
        
        train_loss[epoch] /= length(train_data)
        print(@sprintf "train loss: %f \n" train_loss[epoch])

    end
    
    return train_loss
end

train (generic function with 1 method)

In [16]:
w0 = copy(w) # create copy of initial weights
train_loss = train(w, train_data);

train loss: 31.824946 
train loss: 12.843956 
train loss: 5.188396 
train loss: 2.100008 
train loss: 0.853800 
train loss: 0.350806 
train loss: 0.147727 
train loss: 0.065707 
train loss: 0.032566 
train loss: 0.019167 


## Visualizing our training progess

In the succeeding chapters, we'll introduce more realistic data, fancier models, more complicated loss functions, and more. But the core ideas are the same and the training loop will look remarkably familiar. Because these tutorials are self-contained, you'll get to know this ritual quite well. In addition to updating out model, we'll often want to do some bookkeeping. Let us plot the training loss, our initial predictions, and the final predictions after training:

In [17]:
ypred_init = Array(predict(w0, xtst));
ypred      = Array(predict(w, xtst));

In [18]:
xtst = Array(xtst);
ytst = Array(ytst);

In [19]:
l = @layout([[a; b] c])
1:10, train_loss
hms1 = [[xtst[1:100, 2], xtst[1:100, 2], 1:10], [ypred_init[1:100], ypred[1:100], train_loss]];
hms2 = [[xtst[1:100, 2], xtst[1:100, 2]], [ytst[1:100], ytst[1:100]]];

plot(hms1...,layout=l, t=[:scatter :scatter :line], color=:green, title=[:Initialized :Trained :Loss],legend=false)
plot!(hms2...,layout=l, t=[:scatter :scatter], color=:red, marker=([:hex :hex], 3), labels=:Real)

As you can see, Julia `Plots` is a powerful convenience for Julia visualizations and data analysis. Using `@layout` we can arbitrarly design any layout. Further, we can easily plot and manipulate 3D images:

In [20]:
scatter(xtst[1:100, 1], xtst[1:100, 2], ypred_init[1:100], legend=false)
scatter!(xtst[1:100, 1], xtst[1:100, 2], ytst[1:100])

## Conclusion 

You've seen that by using Knet's with `grad` function, we can build statistical models from scratch. In the following tutorials, we'll build on this foundation, introducing the basic ideas behind modern neural networks and demonstrating the powerful abstractions in Knet.

## Next
[Logistic regression](section3-logistic-regression.ipynb)

For whinges or inquiries, [open an issue on  GitHub.](https://github.com/moralesq/Knet-the-Julia-dope)