# Neurons

We have now reached one of the main subjects of the course: **neurons**.

A neuron (by which we mean an "artificial neuron") is a caricature of a real biological neuron: it has many inputs $x_1, \ldots, x_n$, and a single output, $y$. FIG

A neuron can thus be represented as a function $f$ with inputs $x_1, \ldots, x_n$ and single output $y$. Usually we take the function to be

$$f_{w,b}(x_1, \ldots, x_n) = \sigma(w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b).$$

Notice that this is just a more general version of the functions that we have been studying until now. The $w_1, \ldots, w_n$ are called **weights**, and $b$ is known as the **bias**.

Since this is a lot to write, and we don't want to write code with lots of different parameters, we collect all of the $x_1, \ldots, x_n$ into a **vector**, and the $w$s into another vector:

$$
\mathbf{x} = \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix};
\qquad
\mathbf{w} = \begin{pmatrix} w_1 \\ \vdots \\ w_n \end{pmatrix}
$$

We thus have

$$f_{\mathbf{w}, b}(\mathbf{x}) = \sigma(\mathbf{w} \cdot \mathbf{x} + b),$$

where the definition of the **dot product** (or scalar product, or inner product) $\mathbf{w} \cdot \mathbf{x}$ is

$$ \mathbf{w} \cdot \mathbf{x} := \sum_i w_i x_i. $$

In Julia, this becomes

In [1]:
σ(x) = 1 / (1 + exp(-x))

f(x, w, b) = σ(w ⋅ x + b)

f (generic function with 1 method)

Note that we can use a syntax in Julia that reflects very closely the mathematics.
    (We could even use bold face if we felt like it.) Here we have implicitly assumed that the user will pass in vectors `x` and `w` to the function `f`, and that they are the same length.
    
[In fact, the function `f` also works with numbers, in which case it is the same as `w*x + b`:

In [2]:
f(3, 4, 5)

0.9999999586006244

In [3]:
σ(3*4 + 5)

0.9999999586006244

## Which weights?

The function `f` maps vectors of length $n$ (in $\mathbb{R}^n$) to numbers (which, due to the definition of $\sigma$, must lie between $0$ and $1$), only once we have chosen particular values for the weights $w_i$ and the bias $b$. How do we choose these?

We will do so as before: we must define a cost function $C(\mathbf{w}, b)$ and minimize this cost function.

The cost function will be provided as follows. We wish to design our function $f$ to model some particular relationship between the input data and the output. 
As a running example, we will use our images of fruit. We could try to design a function $f$ to do the following: take in a picture of the fruit, and output a $0$ if it is an apple, and a $1$ if it is a banana.

In [6]:
using DataFrames

DataFrame(Dict(:A=>[1], :B=>[3]))

Unnamed: 0,A,B
1,1,3


We will do so later. To start with, we will use summary statistics derived from the images.

In [8]:
# Pkg.add("TextParse")
using TextParse

cols, colnames = TextParse.csvread("apples.dat",'\t')
apples = DataFrame(Dict(name=>col for (name, col) in zip(colnames, cols)))
cols, colnames = TextParse.csvread("bananas.dat",'\t')
bananas = DataFrame(Dict(name=>col for (name, col) in zip(colnames, cols)))


Unnamed: 0,blue,green,height,red,width
1,0.20750413943355192,0.500662309368192,98,0.5835067538126356,99
2,0.18687105966305895,0.514868541076373,50,0.6096658836445241,99
3,0.18750355493040088,0.5157592947018083,52,0.61001222127424,99
4,0.23931596284375803,0.4929391278094967,99,0.5680333392636676,69
5,0.1833291091603121,0.5148498882514222,51,0.6097662880271572,99
6,0.1833736532775401,0.5146088965412025,53,0.6080886893475433,99
7,0.18620604259534496,0.5154866998805462,53,0.6086763977507786,99
8,0.1885818309426686,0.5164860873506747,53,0.6092775365179625,99
9,0.24194052691261872,0.4945007761667278,99,0.5690986557452848,67
10,0.18911599753288377,0.5161825666623635,54,0.610284435646972,99


These give us `DataFrames` with the data from different images.
We will just use two data points for each image, say the columns 1 and 3, so that each data point $\mathbf{x}^{(i)}$ is a 2-dimensional vector. We also have the label of each point as an apple or a banana, which we call $\mathbf{y}^{(i)}$.

Our neuron will take a point in the two-dimensional plane as argument and try to **classify** it as an apple ($0$) or a banana ($1$). To do so, it must "**learn**" the correct values of the parameters $\mathbf{w}$ and $b$. Note that *in general we cannot expect that this is actually possible*. If it struggles, we may need a more complicated function; see later.

Let's start by putting all the data in a single Julia vector `x` (of which each entry is itself a vector), and the labels in a single vector `y`. [We might need to normalize column 1 to not have huge values. How is this usually dealt with?]

In [9]:
col1 = 3
col2 = 4

x_apples  = [ [apples[i, col1], apples[i, col2]] for i in 1:size(apples)[1] ]
x_bananas = [ [bananas[i, col1], bananas[i, col2]] for i in 1:size(bananas)[1] ]

x = vcat(x_apples, x_bananas)

y = vcat( zeros(size(x_apples)[1]), ones(size(x_bananas)[1]) );

However, the cost function accepts a single two-vector $x$ and the corresponding label $y$.

In [10]:
C(params, x, y) = ( w = params[1:2]; b = params[3]; (y - f(x, w, b))^2 )
#C(w, b) = C([w, b])

C (generic function with 1 method)

We will examine each data point in turn to try to nudge the cost function in the right direction.
We start with *random* parameter values.

In [11]:
using ForwardDiff

In [12]:
w = rand(2)
b = rand()

params = [w; b]

3-element Array{Float64,1}:
 0.366305
 0.221067
 0.599116

IS THIS STOCHASTIC GRADIENT DESCENT?

In [13]:
function gradient_descent(C, x, y, params, N=1000)

    η = 0.01

    for i in 1:N
        
        which = rand(1:length(x))  # choose a data point
        
        xx = x[which]
        yy = y[which]
        
        grad = ForwardDiff.gradient(ws -> C(ws, xx, yy), params)
        #@show grad
        params -= η * grad
    end
    
    return params
    
end
    

gradient_descent (generic function with 2 methods)

In [14]:
w = rand(2)
b = rand()

params = [w; b] ;

@show params
@time params = gradient_descent(C, x, y, params, 1000000)
@show params

params = [0.739475, 0.423584, 0.464474]
 10.648472 seconds (11.47 M allocations: 865.158 MiB, 1.42% gc time)
params = [0.739475, 0.423584, 0.464474]


3-element Array{Float64,1}:
 0.739475
 0.423584
 0.464474

We can check that we have reached a minimum of the cost function, where the gradient should be close to zero:

In [15]:
 ForwardDiff.gradient(ws -> C(ws, x[1], y[1]), params)

3-element Array{Float64,1}:
 1.14236e-27
 6.36503e-30
 1.26928e-29

We can check sample data to see if the function is correctly approximated:

In [16]:
f(x[900], params[1:2], params[3])

1.0

In [17]:
f(x[1], params[1:2], params[3])

1.0

We see that *with sufficient training*, the single neuron is approximately able to learn the function for most of the data, but not all:

In [18]:
maximum(y .- f.(x, [params[1:2]], params[3]))

9.595447658661271e-9

In [19]:
using Plots; gr()

Plots.GRBackend()

In [20]:
scatter(first.(x_apples), last.(x_apples), m=:cross)
scatter!(first.(x_bananas), last.(x_bananas))

Let's draw the function that the network has learned, together with the data:

In [21]:
heatmap(0:0.01:1, 0:0.01:1, (x,y)->f([x,y], params[1:2], params[3]))

scatter!(first.(x_apples), last.(x_apples), m=:cross)
scatter!(first.(x_bananas), last.(x_bananas))

We see that the single neuron has *learnt* to separate the data using something that is close to a hyperplane. (Somehow we restricted the function to be a hyperplane.)