**Capa oculta:** $\quad\boldsymbol{z}=\mathbf{W}\boldsymbol{x}+\boldsymbol{b}_1\quad$ seguido de $\quad\boldsymbol{h}=\boldsymbol{\sigma}(\boldsymbol{z})$
$$\mathbf{W}=\begin{pmatrix}1&1\\-1&-1\\1&1\end{pmatrix}\qquad\boldsymbol{b}_1=\begin{pmatrix}1\\-1\\1\end{pmatrix}$$

**Capa de salida:** $\quad\boldsymbol{a}=\mathbf{V}\boldsymbol{h}+\boldsymbol{b}_2\quad$ seguido de $\quad\hat{\boldsymbol{y}}=\boldsymbol{\sigma}(\boldsymbol{a})$
$$\mathbf{V}=\begin{pmatrix}1&1&1\\-1&-1&-1\end{pmatrix}\qquad\boldsymbol{b}_2=\begin{pmatrix}1\\-1\end{pmatrix}$$

**Pérdida cuadrática (para un par entrada salida):** $\quad\mathcal{L}=\frac{1}{2}\lVert\boldsymbol{y}-\hat{\boldsymbol{y}}\rVert_2^2$

**Par entrada-salida:** $\quad\boldsymbol{x}=(1,1)^t\qquad\boldsymbol{y}=(0.1,0.9)^t$

**Forward:** $\;$ pre-activaciones, activaciones y pérdida

In [1]:
import numpy as np; np.set_printoptions(precision=4)
sigmoid = lambda x: 1.0 / (1.0 + np.exp(-x));
x = np.array([1, 1]); y = np.array([1, 0])
W = np.array([[1, 1], [-1, -1], [1, 1]]); b1 = np.array([1, -1, 1])
V = np.array([[1, 1, 1], [-1, -1, -1]]); b2 = np.array([1, -1])
z = (W @ x + b1); print("z =", z)
h = sigmoid(z); print("h =", h)
a = V @ h + b2; print("a =", a)
haty = sigmoid(a); print("haty =", haty)
loss = .5 * np.square(y-haty).sum(); print("loss =", round(loss, 4))

z = [ 3 -3  3]
h = [0.9526 0.0474 0.9526]
a = [ 2.9526 -2.9526]
haty = [0.9504 0.0496]
loss = 0.0025


**Backward:** $\;$ Jacobianas de la pérdida con respecto a activaciones, pre-activaciones y parámetros
$$\begin{align*}
\boldsymbol{u}^t&=\dfrac{\partial\mathcal{L}}{\partial\hat{\boldsymbol{y}}}=(\hat{\boldsymbol{y}}-\boldsymbol{y})^t\\
\boldsymbol{u}^t&=\boldsymbol{u}^t\dfrac{\partial\hat{\boldsymbol{y}}}{\partial\boldsymbol{a}}=\boldsymbol{u}^t\operatorname{diag}(\boldsymbol{\sigma}'(\boldsymbol{a}))\\
\boldsymbol{g}_{\mathbf{V}}&=\boldsymbol{u}^t\dfrac{\partial\boldsymbol{a}}{\partial\mathbf{V}}=\boldsymbol{h}\boldsymbol{u}^t\\
\boldsymbol{g}_{\boldsymbol{b}_2}&=\boldsymbol{u}^t\dfrac{\partial\boldsymbol{a}}{\partial\boldsymbol{b}_2}=\boldsymbol{u}^t
\end{align*}$$


* *Predicción (activación de la capa de salida):* $\quad\boldsymbol{u}^t=\dfrac{\partial\mathcal{L}}{\partial\hat{\boldsymbol{y}}}=(\hat{\boldsymbol{y}}-\boldsymbol{y})^t$
* *Pre-activación de la capa de salida:* $\quad\boldsymbol{u}^t=\boldsymbol{u}^t\dfrac{\partial\hat{\boldsymbol{y}}}{\partial\boldsymbol{a}}=\boldsymbol{u}^t\operatorname{diag}(\boldsymbol{\sigma}'(\boldsymbol{a}))$
* *Parámetros de la capa de salida:* $\quad\boldsymbol{g}_{\mathbf{V}}=\boldsymbol{u}^t\dfrac{\partial\boldsymbol{a}}{\partial\mathbf{V}}=\boldsymbol{h}\boldsymbol{u}^t\quad\boldsymbol{g}_{\boldsymbol{b}_2}=\boldsymbol{u}^t\dfrac{\partial\boldsymbol{a}}{\partial\boldsymbol{b}_2}=\boldsymbol{u}^t$
* *Activación de la capa oculta:* $\quad\boldsymbol{u}^t=\boldsymbol{u}^t\dfrac{\partial\boldsymbol{a}}{\partial\boldsymbol{h}}=\boldsymbol{u}^t\mathbf{V}$
* *Pre-activación de la capa de oculta:* $\quad\boldsymbol{u}^t=\boldsymbol{u}^t\dfrac{\partial\boldsymbol{h}}{\partial\boldsymbol{z}}=\boldsymbol{u}^t\operatorname{diag}(\boldsymbol{\sigma}'(\boldsymbol{z}))$
* *Parámetros de la capa de oculta:* $\quad\boldsymbol{g}_{\mathbf{W}}=\boldsymbol{u}^t\dfrac{\partial\boldsymbol{z}}{\partial\mathbf{W}}=\boldsymbol{x}\boldsymbol{u}^t\quad\boldsymbol{g}_{\boldsymbol{b}_1}=\boldsymbol{u}^t\dfrac{\partial\boldsymbol{z}}{\partial\boldsymbol{b}_1}=\boldsymbol{u}^t$

In [2]:
J_haty = haty-y;                         print("J_haty =", J_haty)
J_a = J_haty * sigmoid(a) * sigmoid(-a); print("J_a =", J_a)
J_V = np.outer(h, J_a);                  print("J_V =", J_V)
J_b2 = J_a;                              print("J_b2 =", J_b2)
J_h = J_a @ V;                           print("J_h =", J_h)
J_z = J_h * sigmoid(z) * sigmoid(-z);    print("J_z =", J_z)
J_W = np.outer(x, J_z);                  print("J_W =", J_W)
J_b1 = J_z;                              print("J_b1 =", J_b1);

J_haty = [-0.0496  0.0496]
J_a = [-0.0023  0.0023]
J_V = [[-0.0022  0.0022]
 [-0.0001  0.0001]
 [-0.0022  0.0022]]
J_b2 = [-0.0023  0.0023]
J_h = [-0.0047 -0.0047 -0.0047]
J_z = [-0.0002 -0.0002 -0.0002]
J_W = [[-0.0002 -0.0002 -0.0002]
 [-0.0002 -0.0002 -0.0002]]
J_b1 = [-0.0002 -0.0002 -0.0002]


**Actualización de parámetros:** $\quad\mathbf{V}=\mathbf{V}-\rho\boldsymbol{g}_{\mathbf{V}}^t\qquad\boldsymbol{b}_2=\boldsymbol{b}_2-\eta\boldsymbol{g}_{\boldsymbol{b}_2}^t\qquad\mathbf{W}=\mathbf{W}-\eta\boldsymbol{g}_{\mathbf{W}}^t\qquad\boldsymbol{b}_1=\boldsymbol{b}_1-\eta\boldsymbol{g}_{\boldsymbol{b}_1}^t$


In [4]:
W  = W  - 1.0 * J_W.T; print("W =", W)
b1 = b1 - 1.0 * J_b1;  print("b1 =", b1)
V  = V  - 1.0 * J_V.T; print("V =", V)
b2 = b2 - 1.0 * J_b2;  print("b2 =", b2)

W = [[ 1.0004  1.0004]
 [-0.9996 -0.9996]
 [ 1.0004  1.0004]]
b1 = [ 1.0004 -0.9996  1.0004]
V = [[ 1.0045  1.0002  1.0045]
 [-1.0045 -1.0002 -1.0045]]
b2 = [ 1.0047 -1.0047]
