# Introduction

In the following, we will consider the simplest problem that can be encountered in machine learning. We have an unknown function $x\rightarrow f(x) \in \mathbb{R}$ and a set of points $\mathcal{S}=\{f\left(x_i\right), x_i \in \mathbb{R}\}$ sampled from our function $f$. The goal is to find an approximate function of $f$ that is satisfactory for the real-world problem at hand. We will see in the following how neural network, with the help of deep learning, can handle these problems. The learning process where the correct answer $f(x_i)$ for each input $x_i$ is known is called supervised learning, as it is similar to a teacher knowing the answer and guiding the student to it.

## 1) Single neuron

### Theory

We begin our exploration of neural networks with their fundamental unit : a simple neuron with one input and one output. To the neuron is associated a weight $W$ and it is feeded with an input $x$ coming from the samples $\mathcal{S}$. Based on this, the neuron outputs

\begin{equation*}
    y = Wx.
\end{equation*}

In supervised learning, the right answer $\xi$ is known. Hence, we can quantify the difference between $\xi$ and the prediction $y$ of the neuron

\begin{equation*}
    \delta = y-\xi = Wx-\xi
\end{equation*}

A mean to quantify the error of the neuron is the squared deviation

\begin{equation*}
    \epsilon = \left(y-\xi \right)^2 = \left(W x-\xi \right)^2
\end{equation*}

The goal in the learning process is to minimize $\epsilon$ with respect to the weight $W$. For this, gradient descent is used, where we take the slope of the function $W\rightarrow\epsilon(W)$ to push the weight $W$ in the direction of the minimum of the function. Note that in the case of the squared deviation, the slope is given by

\begin{align}
    \frac{\partial \epsilon}{\partial W} &= \frac{\partial}{\partial W} \left( Wx - \xi\right)^2 \\
                                        &= 2\left( Wx - \xi\right)x \\
                                        & \propto \delta x \\     
\end{align}

If the slope is too steep, we might push the weight too far and go beyond the minimum. To avoid this, we define a learning rate $\alpha$ which is a given parameter of the learning process. Finally, we update the weight with the following formula

\begin{equation}
    W' = W-\alpha\delta x
\end{equation}

Now, let's look at how to code it in Julia.

### Code Julia

We define a single neuron as a new structure that encodes a weight, $W$. This weight must be updated, so the structure has to be mutable.

In [5]:
mutable struct singleNeuronModel
    W::Float64
end

We define on this structure a function which gives the prediction of the neuron as a product of the weight with the input.

In [6]:
(m::singleNeuronModel)(x) = m.W * x

Let's instantiate a single neuron with a random weight.

In [20]:
model = singleNeuronModel(rand())

singleNeuronModel(0.297172742555168)

Let's test this neuron with some dummy data and compute the error.

In [21]:
x = 5
ξ = 2
δ = model(x)- ξ
ϵ = δ^2
println(model(x))
println(δ)
println(ϵ)

1.48586371277584
-0.5141362872241599
0.26433612184064387


Of course, as the weight is random, so is the initial error. We now want to train our neuron to reduce the error. To do this, we update the weight recursively with the help of gradient descent. In the train! function, we denote the input $x\equiv x_\mathrm{train}$ and the right answer $\xi\equiv y_\mathrm{train}$ as these are the training data on which the neuron can learn.

In [22]:
function train!(model, xtrain, ytrain, α, iteration)
    for i in 1:iteration
        y = model(xtrain) #Compute the prediction from the model
        δ = y - ytrain #Difference between the prediction and the expectation
        ϵ = δ^2 #Error computed with the mean squared
        model.W = model.W -(δ*xtrain*α) #Update the weight
        @show ϵ
    end
end

train! (generic function with 1 method)

This function takes as arguments our neuron, an input and the corresponding answer but also a learning rate $\alpha$ and the number of times that we want the weight to be updated. Let's see how the error evolves at each iteration.

In [23]:
train!(model, x, ξ, 0.01, 500)

ϵ = 0.26433612184064387
ϵ = 0.14868906853536223
ϵ = 0.08363760105114122
ϵ = 0.04704615059126694
ϵ = 0.026463459707587688
ϵ = 0.014885696085518062
ϵ = 0.00837320404810389
ϵ = 0.004709927277058445
ϵ = 0.002649334093345364
ϵ = 0.0014902504275067844
ϵ = 0.0008382658654725663
ϵ = 0.00047152454932831365
ϵ = 0.00026523255899717647
ϵ = 0.00014919331443591176
ϵ = 8.392123937020036e-5
ϵ = 4.72056971457377e-5
ϵ = 2.65532046444786e-5
ϵ = 1.493617761252093e-5
ϵ = 8.401599907042379e-6
ϵ = 4.7258999477115795e-6
ϵ = 2.6583187205874014e-6
ϵ = 1.4953042803304133e-6
ϵ = 8.411086576858576e-7
ϵ = 4.731236199482949e-7
ϵ = 2.6613203622103043e-7
ϵ = 1.4969927037437257e-7
ϵ = 8.4205839585649e-8
ϵ = 4.736578476692756e-8
ϵ = 2.6643253931324267e-8
ϵ = 1.4986830336369898e-8
ϵ = 8.430092064208068e-9
ϵ = 4.741926786101748e-9
ϵ = 2.667333817193701e-9
ϵ = 1.5003752721757573e-9
ϵ = 8.439610905924128e-10
ϵ = 4.747281134582323e-10
ϵ = 2.670345638184414e-10
ϵ = 1.5020694215059464e-10
ϵ = 8.44914049617505e-11
ϵ = 4.7526415

The error reduces as expected ! Of course, a single neuron is unable to approximate complex functions. To get more flexibility, we need more neurons and some sort of non-linearity.