## Overview
 * AutoGrad
 * Logistic Regression revisited
 * Multiclass logistic regression
 * Multilayer Neural Networks and Deeplearning
 
closely following [Knet-the-Julia-dope](https://github.com/moralesq/Knet-the-Julia-dope) in turn based on [Deep Learning - The Straight Dope](http://gluon.mxnet.io/) using MXNet which can also be accessed from Julia



### Introduction
Typically gradients in neural network are computed using back-propagation. An alternative is to use autograd.  This is the approach taken by the popular [autograd](https://github.com/HIPS/autograd) Python package and its Julia port AutoGrad.jl used by Knet.




In [None]:
using AutoGrad
using Knet, Plots, DataFrames
gr()
using Suppressor

As a toy example:  $f = 2x^2$

In [None]:
x    = [1 2; 3 4];
f(x) = 2x^2;

In Knet g=grad(f) generates a gradient function g, which takes the same inputs as the function f but returns the gradient. The gradient function g triggers recording by boxing the parameters in a special data type and calls f. The elementary operations in f are overloaded to record their actions and output boxed answers when their inputs are boxed. The sequence of recorded operations is then used to compute gradients.

In [None]:
g = grad(f)
g(5.0)

# (Stochastic) Gradient Descent

In order to minimize $f$ we consider the recursion:
$$ x_{n+1}=x_{n}-\epsilon \nabla f(x_n)$$

In [None]:
using Plots
xs=Float64[]
lr=0.05
x=4.0
for i=1:100
    x-=lr *g(x)
    push!(xs,x)
end
plot([1:100;],xs); ylabel!("x");xlabel!("steps")



# (Stochastic) Gradient Descent
In order to minimize $f$ we consider the recursion:
$$ x_{n+1}=x_{n}-\epsilon \widehat{\nabla f}_n(x_n)$$
where  $\mathbb{E}\widehat{\nabla f}_n(x)=\nabla f(x)$

In [None]:
using Plots
xs=Float64[]
lr=0.05
x=4.0
for i=1:100
    x-=lr *(g(x)+randn())
    push!(xs,x)
end
plot([1:100;],xs)
ylabel!("x");xlabel!("steps")

# Binary classification with logistic regression


The simplest kind of classification problem is *binary classification*, when there are only two categories, so let's start there. Let's call our two categories the positive class $y_i=1$ and the negative class $y_i = 0$ (another common way of defining the labels are $y_i=\pm1$). Even with just two categories, and even confining ourselves to linear models, 
there are many ways we might approach the problem. For example, we might try to draw a line that best separates the points:

![](../img/linear-separator.png)

Recall that in linear regression, we made predictions of the form

$$ \hat{y} = \boldsymbol{w}^T \boldsymbol{x} + b, $$

where $\hat{y},b\in\mathbb{R}$ and $\boldsymbol{w},\boldsymbol{x}\in\mathbb{R}^d$. We are interested in asking the question *"what is the probability that example $\boldsymbol{x}$ belongs to the positive class?"* A regular linear model is a poor choice here because it can output values greater than $1$ or less than $0$. To coerce reasonable answers from our model,  we're going to modify it slightly, by running the linear function through a sigmoid activation function $\sigma$:

$$ \hat{y} =\sigma(\boldsymbol{w}^T \boldsymbol{x} + b). $$

The sigmoid function $\sigma$, sometimes called a squashing function or a *logistic* function - t- maps a real-valued input to the range 0 to 1. If we pick the labels $y\in(0,1)$ we may assign  

\begin{equation}
\begin{aligned}
\mathbb{P}(y=1|z) & =\sigma(z)=\frac{1}{1+e^{-z}}\\
\mathbb{P}(y=0|z) & =1-\sigma(z)=\frac{1}{1+e^{z}}\\
\end{aligned}
\end{equation}

Compact form: $\mathbb{P}(y|z)  =\sigma(z)^y(1-\sigma(z))^{1-y}$. Let us define and visualize this function:

In [None]:
# Logistic function
sigmoid(z) = 1 ./ (1 + exp.(-z))
plot(-5:0.1:5, sigmoid(-5:0.1:5), xlabel=:z, ylabel="sigmoid(z)", title="Logistic Function", legend=false, size=(400,200))

## Fitting the model

Maximum likelihood estimator
$$\max_{\theta} \mathbb{P}_{\theta}\big( y_1,\dots,y_n \big|\,\boldsymbol{x}_1,\dots\boldsymbol{x}_n \big)=\max_\theta \prod_i^n\mathbb{P}_\theta(y_i| \boldsymbol{x}_i)$$

Because each example is independent of the others.


$$\max_\theta \log\big(\prod_i^n\mathbb{P}(y_i|\boldsymbol{x}_i)\big)= \sum_i^m\log\big(\mathbb{P}(y_i|\boldsymbol{x}_i)\big)=\log\big(\mathbb{P}(y_1|\boldsymbol{x}_1)\big)+\cdots+\log\big(\mathbb{P}(y_n|\boldsymbol{x}_n)\big)$$

Because we typically express our objective as a *loss* we can just flip the sign, giving us the *negative log probability:*

$$  \min_\theta \Big(- \sum_i^m\log\big(\mathbb{P}(y_i|\boldsymbol{x}_i)\big)\Big)$$

We can write $\mathbb{P}_\theta(y_1|z_i)$ compactly as

$$\mathbb{P}_\theta(y_i|z_i) =\sigma(z_i)^{y_i}(1-\sigma(z_i))^{1-y_i},$$

$$
\log\big(\mathbb{P}_\theta(y|z)\big)=yz + \log\sigma(-z)
$$



Note that this loss function is commonly called *log loss* and also commonly referred to as *binary cross entropy*. It is a special case of negative log [likelihood](https://en.wikipedia.org/wiki/Likelihood_function). And it is a special case of [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy), which can apply to the multi-class ($>2$) setting. 

**If instead we were to use the labels $y_i=\pm1$, the loss function has to modified to $\log(1+e^{-z})$. This usually leads to a lot of confussion as to why there exists two versions of logistic regression. See [here](https://stats.stackexchange.com/questions/250937/which-loss-function-is-correct-for-logistic-regression/279698#279698) for more information on the topic**

## The Adult Dataset

We'll use the Adult dataset taken from the [UCI repository](http://archive.ics.uci.edu/ml/datasets.html). 

 * the dataset contained $14$ features, including age, education, occupation, sex, native-country, among others. In this version, hosted by [National Taiwan University](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html),
 * The label is a binary indicator indicating whether the person corresponding to each row made more ($y_i = 1$) or less ($y_i = 0$) than $50,000 of income in 1994. 
 * The dataset we're working with contains 30,956 training examples and 1,605 examples set aside for testing. We can download and read the datasets into main memory like so:

In [None]:
url_train = "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a2a.t"
url_test  = "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a"

if !isfile("adult.train")
    rawdata_train = readtable(download(url_train, "adult.train"), header=false);
else
    rawdata_train = readtable("adult.train", header=false);
end

if !isfile("adult.test")
    rawdata_test  = readtable(download(url_test, "adult.test"), header=false);
else
    rawdata_test  = readtable("adult.test", header=false);    
end

@sprintf "Training size = %d    Testing size = %d" size(rawdata_train, 1) size(rawdata_test, 1)

The data consists of lines like the following:

-1 4:1 6:1 15:1 21:1 35:1 40:1 57:1 63:1 67:1 73:1 74:1 77:1 80:1 83:1

The first entry in each row is the value of the label. The following tokens are the indices of the non-zero features. The number $1$ here is redundant. But we don't always have control over where our data comes from, so we might as well get used to mucking around with weird file formats. Let's write a simple script to process our dataset.

In [None]:
function nonzeroindex(sample)

    s      = split(sample, ":1 ") 
    val    = parse(Int64, split(sample)[1])
    s[1]   = split(s[1])[2]
    s[end] = split(s[end], ":")[1]
    
    output = zeros(Float32, 1, 124)
    output[parse.(Int64, s)] = 1
    output[end] = val
    return output
end

function processdata(rawdata; atype=Array{Float32})
    data = map(atype, [vcat(nonzeroindex.(rawdata[1])...)'])[1];
    # change label from {-1,1} to {0,1}
    x, y = map(atype, [data[1:end-1, :], (data[end:end, :] + 1) / 2])
    return x, y
end

In [None]:
atype=Array{Float32}; 
xtrn, ytrn  = processdata(rawdata_train, atype=atype);
xtst, ytst  = processdata(rawdata_test, atype=atype);

We can also check the fraction of positive examples in our training and test sets. This will give us one nice (necessay but insufficient) sanity check that our training and test data really are drawn from the same distribution.

In [None]:
sum(ytrn) / length(ytrn), sum(ytst) / length(ytst)

Knet's mini batch function sets up the stochastic gradient 
$$\nabla \sum_i^m\log\big(\mathbb{P}(y_i|\boldsymbol{x}_i)\big) \approx \frac{m}{s} \sum_i^s \nabla \log\big(\mathbb{P}(y_{\tau_s}|\boldsymbol{x}_{\tau_s})\big)$$


here $\{\tau_1,\dots,\tau_s\}$ is a random subset of  $\{1,\dots,m\}$. $s$ is called the batch sizes and stochastic gradients are implemented through Knets *minibatch* function.

In [None]:
dtrn = Knet.minibatch(xtrn, ytrn, 64, shuffle=true);

## Minimizing the loss
*w* here are the weights which are choosen to minimise the cross entropy loss. This achieved by writing the loss and taking the gradient with respect the first variable.

In [None]:
pred(w, x) = w[1] * x .+ w[2];

function loss(w, x, y)
    yhat = sigm.(pred(w, x))
    return -sum(y .* log.(yhat) + (1-y) .* log.(1-yhat))
end
lossgradient  = grad(loss)


The training loop consists of an 
  * outerloop counting the epochs (effective iterations through the data set).
  * an inner loop that das gradient descent based on the stochastic gradient of the loss function

In [None]:
function train(w, dtrn; lr=1e-6, epochs=5)
    tloss = []
    for epoch = 1:epochs
        eloss = 0
        for (x,y) in dtrn
            eloss += loss(w, x, y)
            g = lossgradient(w, x, y)
            for i = 1:length(w)
                w[i] -= lr * g[i]
            end
        end
        push!(tloss, eloss/length(dtrn))
    end
    
    return w, tloss
end

## Training output
Before training

In [None]:
w = map(atype, Any[randn(1, size(xtrn, 1)), zeros(Float32,1,1) ]);
Accuracy(w, xtst, ytst)

In [None]:
w, Loss = train(w, dtrn; epochs=30, lr=1e-2);

and after:

In [None]:
@show Accuracy(w, xtrn, ytrn)
@show Accuracy(w, xtst, ytst)@show 

## Interpretation of the output

 * A naive classifier would predict that nobody had an income greater than $50k (the majority class). achieve an accuracy of roughly 75\%. 
 * By contrast, our classifier gets an accuracy of .84 (results may vary a small amount on each run owing to random initializations and random sampling of the batches).


# Multiclass Logistic Regression 


* Binary classification is quite useful. (spam vs. not spam or cancer vs not cancer)
* $k$ classes  (different handwritten digits).  What can we do?
 1. use binary classifier in clever way 
   * train $K$ different binary classifiers $f_k(\boldsymbol{x})$ use the one with highest probability
   * repeated one vs the rest 
 2. generate an output that is  a discrete probability distribution over the $K$ classes. 
 
This *softmax* function achieves this by 

$$\mbox{softmax}(\boldsymbol{z}) = \frac{e^{\boldsymbol{z}} }{\sum_{k=1}^K e^{z_i}},$$

Because now we have $K$ outputs - we can represent this graphically.
![](../img/simple-softmax-net.png)

 mapping from inputs to outputs via a matrix-vector product $W \boldsymbol{x} + \boldsymbol{b}$:

$$\hat{y} = \mbox{softmax}(W\boldsymbol{x} + \boldsymbol{b})$$

This model is sometimes called *multiclass logistic regression*, *softmax regression* and *multinomial regression*.


## The MNIST dataset

In this example we build a simple classification model for the [MNIST](http://yann.lecun.com/exdb/mnist/) handwritten digit recognition dataset. MNIST has 60000 training and 10000 test examples. Each input x consists of 784 pixels representing a 28x28 image. The corresponding output indicates the identity of the digit 0..9.

![png](https://jamesmccaffrey.files.wordpress.com/2014/06/firsteightimages.jpg) 

To start, we'll use Knet's utilities for grabbing a copy of this dataset:

In [None]:

for p in ("GZip",)
    Pkg.installed(p) == nothing && Pkg.add(p)
end

using GZip

"Where to download mnist from"
mnisturl = "http://yann.lecun.com/exdb/mnist"

"Where to download mnist to"
mnistdir = "./"

"""
This utility loads the [MNIST](http://yann.lecun.com/exdb/mnist)
hand-written digits dataset.  There are 60000 training and 10000 test
examples. Each input x consists of 784 pixels representing a 28x28
grayscale image.  The pixel values are converted to Float32 and
normalized to [0,1].  Each output y is a UInt8 indicating the correct
class.  10 is used to represent the digit 0.
```
# Usage:
include(Pkg.dir("Knet/data/mnist.jl"))
xtrn, ytrn, xtst, ytst = mnist()
# xtrn: 28×28×1×60000 Array{Float32,4}
# ytrn: 60000-element Array{UInt8,1}
# xtst: 28×28×1×10000 Array{Float32,4}
# ytst: 10000-element Array{UInt8,1}
```
"""
function mnist()
    global _mnist_xtrn,_mnist_ytrn,_mnist_xtst,_mnist_ytst
    if !isdefined(:_mnist_xtrn)
        info("Loading MNIST...")
        _mnist_xtrn = _mnist_xdata("train-images-idx3-ubyte.gz")
        _mnist_xtst = _mnist_xdata("t10k-images-idx3-ubyte.gz")
        _mnist_ytrn = _mnist_ydata("train-labels-idx1-ubyte.gz")
        _mnist_ytst = _mnist_ydata("t10k-labels-idx1-ubyte.gz")
    end
    return _mnist_xtrn,_mnist_ytrn,_mnist_xtst,_mnist_ytst
end

"Utility to view a MNIST image, requires the Images package"
mnistview(x,i)=colorview(Gray,permutedims(x[:,:,1,i],(2,1)))

function _mnist_xdata(file)
    a = _mnist_gzload(file)[17:end]
    reshape(a ./ 255f0, (28,28,1,div(length(a),784)))
end

function _mnist_ydata(file)
    a = _mnist_gzload(file)[9:end]
    a[a.==0] = 10
    # full(sparse(a,1:length(a),1f0,10,length(a)))
    return a
end

function _mnist_gzload(file)
    if !isdir(mnistdir)
        mkpath(mnistdir)
    end
    path = joinpath(mnistdir,file)
    if !isfile(path)
        url = "$mnisturl/$file"
        download(url, path)
    end
    f = gzopen(path)
    a = read(f)
    close(f)
    return(a)
end
xtrn, ytrn, xtst, ytst = mnist()


There are two parts of the dataset for training and testing. Each part has N items and each item is a tuple of an image and a label:

In [None]:
xtrn, ytrn, xtst, ytst = mnist()

size(xtrn), size(ytrn)

Note that each image has been formatted as a 3-tuple (height, width, channel). For color images, the channel would have 3 dimensions (red, green and blue). In this case we have a gray scale image with a single channel. Note that `ytrn` is a `Array{UInt8,1}` data type. Let us take a look at these labels:

In [None]:
ytrn[1:10], 1ytrn[1:10]

Ok, let us now use Knet's minibatch function to prepare the date for training:

In [None]:
dtrn = minibatch(xtrn, ytrn, 100; xtype=atype);
dtst = minibatch(xtst, ytst, 100; xtype=atype);

Note that the input images have been reshaped to 784x1

In [None]:
dtrn.xtype, dtrn.ytype

## Model 

### Multiclass logistic regression

### Multiclass logistic regression

Given $W$ and $b$ what is the loss?

 1. The misclassification loss function: counts the number of instances inaccurately classified
 2. Similar to logistic regression we would like to take gradients to find $W_{opt}$ and $b_{opt}$


The relevant loss function here is called [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy). The cross entropy between two probability distributions $p$ and $q$ is defined as

$$H(p,q)=\operatorname {E}_{p}[-\log q]=H(p)+D_{{{\mathrm  {KL}}}}(p\|q)=-\sum_{i=1}^n p_i \log q_i$$

We typically observe fixed labels corresponding to $p_{i}=\begin{cases}
1 & i=k\\
0 & \text{otherwise}
\end{cases}$.

Thus the loss corresponds to the log loss.


### Implementation

Assument that $y$ is an $K\times m$ matrix, where $m$ is the number of samples and $K$ is the number of labels. A Naive implementaion of this theory is a follow:

In [None]:
softmax(z) = exp.(z) ./ sum(exp.(z), 1)
cross_entropy(yhat, y) = - sum(y .* log.(yhat), 1)

In [None]:
m = 1000;
K = 10;
z = rand(K, m);
yhat = softmax(z);

Let's confirm that indeed all of our rows sum to 1:

In [None]:
mapslices(sum, yhat, 1)

We could then implement the loss functions as: (in the code of Knet a more numerically stable version is implemented as nll )

In [None]:
cross_entropy(yhat, y) = - sum(sum(z .* log.(z), 1))

## Gradient descent for multiclass logistic regression

In [None]:
predict(w,x) = w[1]*mat(x) .+ w[2] # mat takes 28,28,1,N) x array to a (784,N).
loss(w,x,ygold) = nll(predict(w,x), ygold)
lossgradient = grad(loss)

Now let's train a model on the MNIST data:

In [None]:
function train(w, data; lr=.1)
    for (x,y) in data
        dw = lossgradient(w, x, y)
        for i in 1:length(w)
            w[i] -= lr * dw[i]
        end   
    end

end

In [None]:
w = map(atype, Any[ 0.1f0*randn(Float32, 10, 784), zeros(Float32, 10, 1) ])
w = Any[ 0.1f0*randn(Float32,10,784), zeros(Float32,10,1) ];
println((:epoch, 0, :trn, accuracy(w,dtrn,predict), :tst, accuracy(w,dtst,predict)))
for epoch=1:10
    train(w, dtrn; lr=0.5)
    println((:epoch, epoch, :trn, accuracy(w,dtrn,predict), :tst, accuracy(w,dtst,predict)))
end

Ok, let us check the accuracy on the testing and training set:

In [None]:
accuracy(w, dtrn, predict), accuracy(w, dtst, predict)

You might reasonably conclude that this problem is too easy to be taken seriously by experts.
But until recently, many papers (Google Scholar says 13,800) were published using results obtained on this data.

# Multilayer perceptrons from scratch

Multiclass logistic regression based on $$\hat{y} = \mbox{softmax}(W \boldsymbol{x} + b)$$to  **deep neural networks** with multiple layers.


Graphically, we could depict the model like this:
![](https://github.com/zackchase/mxnet-the-straight-dope/blob/master/img/simple-softmax-net.png?raw=true)


 * *But linearity is a strong assumption*. - each pixel,  increasing its value either increases probability that it depicts a dog or decreases it.

We can model a more general class of functions by incorporating one or more *hidden layers*.
Each layer feeds in to the layer above it, until we generate an output.
This architecture is commonly called a **"multilayer perceptron"**.

$$ h_1 = \phi(W_1\boldsymbol{x} + b_1) $$
$$ h_2 = \phi(W_2\boldsymbol{h_1} + b_2) $$
$$...$$
$$ h_n = \phi(W_n\boldsymbol{h_{n-1}} + b_n) $$

 * each layer has its own parameters $W_i$ and $b_i$  
 * its output is feed throughactivation function for the hidden layers as $\phi$
 * topmost hidden layer,  for classification, we'll stick with the softmax activation in the output layer.

$$ \hat{y} = \mbox{softmax}(W_y \boldsymbol{h}_n + b_y)$$

Graphically, a multilayer perceptron could be depicted like this:

![](https://github.com/zackchase/mxnet-the-straight-dope/blob/master/img/multilayer-perceptron.png?raw=true)

In [None]:
## Activation energy
function leaky_relu(x, alpha=0.2)
    pos = max(0,x)
    neg = min(0,x) * alpha
    return pos + neg
end
xs=[-5.0:0.1,5;]
plot(xs,relu.(xs))
plot!(xs,leaky_relu.(xs))


## Remark
 * It's easy to design a hidden node that that do arbitrary computation,
say logical operations.
 * Any function can be approximated by single-hidden-layer neural network
 * Actually learning the weights that function is the hard part. (might be easier for deeper networks)
 * Choice of architecture is tricky and a craft
  * Fully connected layer has many parameters - special purpose convolutional layers 
  * inspired from our brain - e.g. visual cortex
  


## Data 

In [None]:
# Load data
xtrn, ytrn, xtst, ytst = mnist()
dtrn = minibatch(xtrn, ytrn, 100, xtype=atype);
dtst = minibatch(xtst, ytst, 100, xtype=atype);

## Model

In [None]:
function initweights(d, scale=0.01; hidden=[2], atype=Array{Float32})
    model = Vector{Any}(2 * length(hidden))
    X = d
    for k = 1:length(hidden)
        H = hidden[k]
        model[2k - 1] = scale * randn(H, X) 
        model[2k]     = scale * randn(H, 1)
        X = H
    end
    return map(atype, model)
end

We can define the function `initmodel` with the desired parameters. The variable `hidden` contains the output sizes for each of the layers, and `num_inputs` is the size of the input variable `x` (in this case $x\in\mathbb{R}^{784}$). 

In [None]:
function initmodel(atype;num_inputs=784,num_hidden=256,num_outputs=10)
    return initweights(num_inputs,hidden=[num_hidden,num_hidden,num_outputs]; atype=atype);
end

In [None]:
function predict(w, x)
    x = mat(x)
    for i=1:2:length(w) - 2
        x = relu.(w[i] * x .+ w[i+1]) # bias an weights are concatendated 
    end
    return w[end - 1]*x .+ w[end]
end

Let's test the predict function to make sure everything works fine:

In [None]:
for (x, y) in dtrn
    display(predict(initmodel(atype), x))
    break 
end

## Training procedure

In [None]:
loss(w, x, ygold, predict) = nll(predict(w, x), ygold);
lossgradient = grad(loss); # AutoGrad means we don't need backpropagation

function train(w, dtrn, optim, predict; epochs=10)
    for epoch = 1:epochs
        for (x, y) in dtrn
            g = lossgradient(w, x, y, predict)
            update!(w, g, optim) ## this a generic train loop the gradient update can be replaced as appropriate
        end
    end
end

## Optimizer (SGD) and reporting

In [None]:
optim(w; lr=0.01) = optimizers(w, Sgd;  lr=lr);

In [None]:
function report(epoch, w, dtrn, dtst, predict)
    println((:epoch, epoch, :trn, accuracy(w, dtrn, predict), :tst, accuracy(w, dtst, predict)))
end

## Train the Model

In [None]:
w   = initmodel(atype);
opt = optim(w, lr=1e-1);
fast=true
nepochs=2
if fast
    train(w, dtrn, opt, predict; epochs=nepochs)
else
    for epoch = 1:nepochs
        train(w, dtrn, opt, predict, epochs=1)
        report(epoch, w, dtrn, dtst, predict)
    end
end

# Beyond basic Neural Nets and tricks of the trade

 * "better" optimisers, see a nice [blog post](http://sebastianruder.com/optimizing-gradient-descent/)
  * Adam reduce noise by taking geometric average of gradients, and scale by componentwise standarddeviation estimate
  * "train in parallel" - elastic averaged stochastic gradient
 * initialisation of neural nets Xavier (variance inverse proportional to number of connections of neurons)
 * regularisation overfitting - perturbing with noise the input (slight change to image should not change label), randomly disable neurons (dropout)
 * Transfer learning incorporate neural network trained on one data set for another task
  * use as part of architecture and online retrain new components
  * the neural network based classifing cats and dogs can help to detect skin cancer in this [Nature publication ](https://www.nature.com/articles/nature21056?error=cookies_not_supported&code=fab53529-b08e-48c2-a26a-72c23e3d69e9)