# 1. A Simple Neural Network

## (a)

Denote the weighted input to the hidden layer resp. the output layer by $z^{[1]}$ resp. $z^{[2]}$. Also denote the activation of the hidden layer by $a^{[1]}$. This means that for a sample $x= (x_1,x_2)$ we have

$$\begin{align*}
z^{[1]}_j &= w^{[1]}_{0,j} + \sum_{i=1}^2 w^{[1]}_{i,j}x_i &\text{for } j\in \{1,2,3\}\\
a^{[1]}_j &= g(z^{[1]}_j)&\text{for } j\in \{1,2,3\}\\
z^{[2]} &= w^{[2]}_{0} + \sum_{j=1}^2 w^{[2]}_{j}a^{[1]}_j\\
o &= g(z^{[2]})
\end{align*}$$
where $g(u)=1/(1+\exp(-u))$ is the sigmoid function.

The loss of a single example $x$ with label $y$ is $(o-y)^2$. 

Therefore the derivative of the cost with respect to $w^{[1]}_{1,2}$ is given by

$$ \begin{align*}
\frac{\partial (o-y)^2}{\partial w^{[1]}_{1,2}} &=\frac{\partial (o-y)^2}{\partial o} \cdot \frac{\partial o}{\partial z^{[2]}}\cdot \frac {\partial z^{[2]}} {\partial a^{[1]}_2} \cdot\frac {\partial a^{[1]}_2}{\partial z^{[1]}_2}\cdot\frac {\partial z^{[1]}_2}{\partial w^{[1]}_{1,2}}\\
&=2(o-y)\cdot g'(z^{[2]})\cdot w^{[2]}_2 \cdot g'(z^{[1]}_2) \cdot x_1
\end{align*}$$

With
$$ l=\frac 1m \sum_{i=1}^m(o^{(i)}-y^{(i)})^2$$
we then get the derivative of cost function as 

$$\frac{\partial l}{\partial w^{[1]}_{1,2}} = \frac 2m\sum_{i=1}^m (o^{(i)}-y^{(i)})\cdot g'(z^{[2](i)})\cdot w^{[2]}_2 \cdot g'(z^{[1](i)}_2) \cdot x^{(i)}_1,$$

with which we get the gradient descent update rule for $w^{[1]}_{1,2}$ as 
$$ w^{[1]}_{1,2} \leftarrow w^{[1]}_{1,2} - \alpha \frac{\partial l}{\partial w^{[1]}_{1,2}}$$

## (b)

Eyeballing from the plot, we find that a sample $x=(x_1,x_2)$ will be in class $y=0$ iff $x$ satisfies the following three inequalities:
$$\begin{align*} 
0.5 &< x_1, \\
0.5 &< x_2, \\
3.5 &> x_1+x_2,
\end{align*}$$
i.e. iff $x$ lies inside of a certain triangle.

We will define the weights between the input layer and the hidden layer in such a way that neuron $h_1$ activates iff $0.5 \geq x_1$, neuron $h_2$ activates iff $0.5 \geq x_2$ and neuron $h_3$ activates iff $3.5 \leq x_1 +x_2$.

This implies that if at least one of the three neurons in the hidden layer activated, then $x$ should be put into class $y=1$. 

Therefore we want the output neuron $o$ to activate iff $a^{[1]}_1 + a^{[1]}_2 + a^{[1]}_3 \geq 1$.

We can achieve all of this with the following weights which will give perfect accuracy:

In [1]:
    w = {}

    w['hidden_layer_0_1'] = 0.5
    w['hidden_layer_1_1'] = -1
    w['hidden_layer_2_1'] = 0
    w['hidden_layer_0_2'] = 0.5
    w['hidden_layer_1_2'] = 0
    w['hidden_layer_2_2'] = -1
    w['hidden_layer_0_3'] = -3.5
    w['hidden_layer_1_3'] = 1
    w['hidden_layer_2_3'] = 1

    w['output_layer_0'] = -1
    w['output_layer_1'] = 1
    w['output_layer_2'] = 1
    w['output_layer_3'] = 1

## (c)

Using $f(x)=x$ in the hidden layer implies that the map $x\mapsto a^{[1]}$ is an affine linear embedding of $x$ into $\mathbb R^3$.  

So to achieve perfect accuracy we would have to be able to linearly separate the embedded images of class 0 and class 1 in $\mathbb R^3$.

To see that this is impossible we just need to note class 0 is contained in the convex hull of class 1 and that this property is preserved by an affine linear embedding, which makes it impossible to separate the two by an affine hyperplane.