# Semester Project: Do It Yourself

## Introduction

The goal of this project is to:
* Write a machine learning algorithm from scratch
* Apply this algorithm to a training set, in this case MNIST[1]
* Examine how well the algorithm functions

I expanded this scope a bit by also creating my own <i>MNIST-style</i> training set and applying the algorithm to it.

## The Network

Neural Net: Simple and low error rates (< 2 %)[1] on the MNIST set.

<img src="media/img/neuralnet.png" alt="drawing" width="60%"/>

* sigmoid, tanh, ReLU or linear activation

```C
typedef struct {
    PyObject_HEAD

    Layer *layers; /* array of layer objects */

    int numLayers; /* number of layers */
    int *numNeurons; /* number of neurons in each layer */

    double (*activationFunc)(double); /* activation function */
    double (*activationFuncGradient)(double); /* gradient of activation function */

    int softmax;    /* enable softmax? */
} NetworkObject;
```

```C
typedef struct layerStruct {
    Node *nodes; /* Nodes in layer */

    void (*initLayer)(struct layerStruct *self,
                      int numberOfNodes,
                      int numberOfPreviousNodes);
} Layer;
```

```C
typedef struct nodeStruct {
    double o; /* Node output */
    double z; /* Net Input */

    double *w; /* Input weights */
    double *grad_w; /* Input weight gradients */
    double b; /* Input bias */
    double grad_b; /* Input bias gradient */

    void (*initNode)(struct nodeStruct *self, int numberOfPreviousNodes);
} Node;
```

```C
for ( sampleInBatch = 0; sampleInBatch < batchsize; sampleInBatch++ )
{
    sample = sampleInBatch + batch * batchsize; // batch offset
    ...
    forwardfeed(self);
    if (self->softmax) softmax(self);
    backpropagation(self, outputs[sample]);
}
```

## Forward Feed
```C
void forwardfeed(NetworkObject *self)
```

$
\begin{align*}
o_j^l = \sigma \left( \sum_{k=0}^{n_{l-1} - 1} o_k^{l-1} w_{jk}^l + b_j^l \right)
= \sigma \left( \mathbf{o^{l-1}} \cdot \mathbf{w_j^l} + b_j^l \right)
\end{align*}
$

```C
int l, j;
for ( l = 1; l < self->numLayers; l++ )
{
    for ( j = 0; j < self->numNeurons[l]; j++ )
    {
        /* net input */            
        self->layers[l].z[j] = dot(self->layers[l].w[j],
                                   self->layers[l-1].o,
                                   self->numNeurons[l-1]);
        self->layers[l].z[j] += self->layers[l].b[j];
        self->layers[l].o[j] = self->activationFunc(self->layers[l].z[j]);
    }
}
```

```C
double (*activationFunc)(double);
```

<div style="column-count: 3;">
<div style="width:10%; display: inline-block;">
sigmoid:
</div>
<div style="width:40%; display: inline-block;">
$
\begin{align*}
\sigma (x) = \frac{1}{1 + e^{-z}}
\end{align*}
$
</div>

<div style="width:40%; display: inline-block;">
$
\begin{align*}
\sigma'(x) = \sigma(x) ( 1 - \sigma(x))
\end{align*}
$
</div>
</div>

<div style="column-count: 2;">
<div style="width:10%; display: inline-block;">
tanh:
</div>
<div style="width:40%; display: inline-block;">
$
\begin{align*}
\sigma (x) = \frac{1 + \tanh(z)}{2}
\end{align*}
$
</div>

<div style="width:40%; display: inline-block;">
$
\begin{align*}
\sigma'(x) = \frac{1 - \sigma(x)^2}{2}
\end{align*}
$
</div>

<div style="column-count: 2;">
<div style="width:10%; display: inline-block;">
ReLU:
</div>
<div style="width:40%; display: inline-block;">
$
\begin{align*}
\sigma (x) =
\begin{cases}
    x, & \text{if}\ x > 0 \\
    0, & \text{otherwise}
\end{cases}
\end{align*}
$
</div>

<div style="width:40%; display: inline-block;">
$
\begin{align*}
\sigma'(x) =
\begin{cases}
    1, & \text{if}\ x > 0 \\
    0, & \text{otherwise}
\end{cases}
\end{align*}
$
</div>

<div style="column-count: 2;">
<div style="width:10%; display: inline-block;">
linear:
</div>
<div style="width:40%; display: inline-block;">
$
\begin{align*}
\sigma (x) = x
\end{align*}
$
</div>

<div style="width:40%; display: inline-block;">
$
\begin{align*}
\sigma'(x) = 1
\end{align*}
$
</div>

## Squared Error Loss

The squared error loss is defined as

$
\begin{align*}
E = \frac{1}{2} \sum_k ( y_k - o_k )^2 
\end{align*}
$

where $\mathbf{y}$ is our target and $\mathbf{o}$ is our predicted output.

The derivative is easy:

$
\begin{align*}
\frac{\partial E}{\partial o_i} = o_i - y_i
\end{align*}
$

## Softmax and Cross Entropy Loss

Softmax normalizes an unnormalized vector into a probability distribution.

$
\begin{align*}
p_j = \frac{e^{o_j}}{\sum_{k=0}^{n_j} e^{o_k}}
\end{align*}
$
for $j = 1,\dots,n_j$.

```C
void softmax(NetworkObject *self)
{
    ...
    for ( j = 0; j < self->numNeurons[l]; j++)
    {
        z += exp(self->layers[l].o[j]);
    }

    for ( j = 0; j < self->numNeurons[l]; j++)
    {
        self->layers[l].o[j] = exp(self->layers[l].o[j]) / z;
    }
}
```

### Entropy

In information theory: Minimum number of bits needed to encode a known distribution (probability mass function) $y_i$.

$
\begin{align*}
H(y) = -\sum_i y_i \log(y_i)
\end{align*}
$

### Cross Entropy

Number of bits needed to encode a probability mass function $y_i$ using an unoptimal distribution $p_i$.

$
\begin{align*}
H(y,p) = - \sum_i y_i \log(p_i)
\end{align*}
$

We want to <b>minimize</b> this <b>loss function</b> so as to reach the optimal encoding $p_i = y_i$.

### Gradient of Cross Entropy Loss

$
\begin{align*}
\frac{\partial H}{\partial o_i} &= \sum_k \frac{\partial H}{\partial p_k} \cdot \frac{\partial p_k}{\partial o_i} \\
&= -\sum_k y_k \frac{1}{p_k} \cdot \frac{\partial p_k}{\partial o_i}
\end{align*}
$

Using the derivative

$
\begin{align*}
\frac{\partial p_k}{\partial a_j} = p_k ( \delta_{ki} - p_i )
\end{align*}
$

of the softmax function we get

$
\begin{align*}
\frac{\partial H}{\partial o_i} = p_i - y_i
\end{align*}
$

which is coincidentally also the loss gradient of the squared error function using the output of the softmax layer.

## Backpropagation

<div style="column-count: 2;">
<div style="width:40%; display: inline-block;">
$
\begin{align*}
\frac{\partial E}{\partial w^l_{jk}} &= \color{red}{\frac{\partial E}{\partial o^l_j} \frac{\partial o^l_j}{\partial z^l_j}} \color{blue}{\frac{\partial z^l_j}{\partial w^l_{jk}}}
\end{align*}
$
</div>

<div style="width:40%; display: inline-block;">
$
\begin{align*}
\frac{\partial E}{\partial b^l_{j}} &= \color{red}{\frac{\partial E}{\partial o^l_j} \frac{\partial o^l_j}{\partial z^l_j}} \color{blue}{\frac{\partial z^l_j}{\partial b^l_{j}}}
\end{align*}
$
</div>

<div style="column-count: 2;">
<div style="width:40%; display: inline-block;">
$
\begin{align*}
\color{blue}{\frac{\partial z^l_j}{\partial w^l_{jk}} = o^{l-1}_k}
\end{align*}
$
</div>

<div style="width:40%; display: inline-block;">
$
\begin{align*}
\color{blue}{\frac{\partial z^l_j}{\partial b^l_{j}} = 1}
\end{align*}
$
</div>

$
\begin{align*}
\color{red}{\delta^l_j = \frac{\partial E}{\partial o^l_j} \frac{\partial o^l_j}{\partial z^l_j}} = \sigma'(o^l_j) \frac{\partial E}{\partial o^l_j} = 
\begin{cases}
    \sigma'(z^l_j) (o^l_j - y^j), & \text{if}\ l = N \\
    \sigma'(z^l_j) \sum_i \delta^{l+1}_i w^{l+1}_{ij}, & \text{if}\ l < N
\end{cases}
\end{align*}
$

<div style="column-count: 2;">
<div style="width:40%; display: inline-block;">
$
\begin{align*}
\Delta w^l_{jk} &= - \eta \color{red}{\delta^l_j} \color{blue}{o^{l-1}_k}
\end{align*}
$
</div>

<div style="width:40%; display: inline-block;">
$
\begin{align*}
\Delta b^l_j &= - \eta \color{red}{\delta^l_j}
\end{align*}
$
</div>

$
\begin{align*}
\delta^l_j = 
\begin{cases}
    \sigma'(z^l_j) (o^l_j - y^j), & \text{if}\ l = N \\
    \sigma'(z^l_j) \sum_i \delta^{l+1}_i w^{l+1}_{ij}, & \text{if}\ l < N
\end{cases}
\end{align*}
$

```C
double delta(NetworkObject *self, int l, int j, double *output)
{
    if ( l < self->numLayers-1) {
        double sum = 0;
        int k;
        for ( k = 0; k < self->numNeurons[l+1]; k++ )
        {
            sum += delta(self, l+1, k, output) * self->layers[l+1].w[k][j];
        }
        return sum * self->activationFuncGradient(self->layers[l].o[j]);
    } else {
        return self->activationFuncGradient(self->layers[l].o[j]) 
                                            * (self->layers[l].o[j] - output[j]);
    }
}
```

## Literature

[1] Y.LeCun et al.. The MNIST database of handwritten digits. available at http://yann.lecun.com/exdb/mnist/ (Jan. 2019)