In [186]:
import os
import numpy as np
import pandas as pd

Intended for personal future references & teaching. If you are looking into learning this on your own, check out (https://www.youtube.com/watch?v=K5P-Ac31Hc4)

# Theory

Initialize Data

In [187]:
X1=[1,0,1,0,1,0,1,0]
X2=[0,1,1,0,0,0,1,1]
X3=[0,0,0,0,1,1,1,1]
y =[1,0,1,0,0,0,1,1]
dict_={"X1":X1, "X2":X2, "X3":X3, "y":y}
df=pd.DataFrame(dict_)
df

Unnamed: 0,X1,X2,X3,y
0,1,0,0,1
1,0,1,0,0
2,1,1,0,1
3,0,0,0,0
4,1,0,1,0
5,0,0,1,0
6,1,1,1,1
7,0,1,1,1


Input layer (A) -> Hidden layer (B) -> output (C)

1. Compute $B=WA$ where W is the weights ($B_i=W_{ij}A_j$)

In [188]:
A=np.array([X1[0],X2[0],X3[0]])
W=-np.random.randn(3,4)

In [189]:
B=np.dot(A,W)
B

array([ 1.39501653,  0.72739444,  1.52188177, -0.27303581])

2. feed B to activation function

In [190]:
def sigmoid(x):
    return 1.0/(1.0+np.exp(-x))

In [191]:
B_act=sigmoid(B)
B_act

array([0.8013919 , 0.67423324, 0.82081541, 0.43216196])

3. Compute C=VB

In [192]:
V=np.random.randn(4,1)
C=np.dot(B_act,V)

In [193]:
C_act=sigmoid(C)
C_act

array([0.79673635])

4. Calculate Loss

In [195]:
L=0.5*(y[0]-C_act)**2
L

array([0.02065806])

Begin of back propagation

5. Reduce Loss  
$W\leftarrow W-\frac{\partial L}{\partial W}$  
$V\leftarrow V-\frac{\partial L}{\partial W}$  

Given that   
$B=b(A,W)$  
$B_{act}=act_b(B)$  
$C=c(B_{act},V)$  
$C_{act}=act_c(C)$  
$L=l(C_{act})$

$\implies L=l(act_c(    c(act_b(  b(A) ,W )   ),V) )$

By chain rule:  
$\frac{\partial l}{\partial V}=\frac{\partial l}{\partial C_{act}} \frac{\partial act_c}{\partial C}  \frac{\partial c}{\partial V}$ where

1. $\frac{\partial l(C_{act})}{\partial C_{act}}\\
=\frac{1}{2}(y-C_{act})^2\\
=C_{act}-y$

In [196]:
dl_dC_act=C_act=y[0]

2. $\frac{\partial act_c(C)}{\partial C}= \frac{\partial}{\partial C}\frac{1}{1+e^{-C}}\\
=-(1+e^{-C})^{-2}\frac{\partial}{\partial C}(1+e^{-C})\\
=\frac{1}{e^C(1+e^{-C})^2}$  

here note $1-act_c(C)=\frac{1}{e^C+1}$  
Thus $\frac{\partial act_c(C)}{\partial C}=act_c(C)(1-act_c(C))$

In [197]:
dact_c_dC=sigmoid(C)*(1-sigmoid(C))

3. $\frac{\partial c}{\partial V}=\frac{\partial }{\partial V}(V_{ij}C_j)$  
here i=1, thus  
$=\frac{\partial}{\partial V_k}(V_{1j}B_j)$  
but $\frac{\partial V_i}{\partial V_j}=1$ if $i=j,$ 0 else. Hence j=k  
$=B_j$

In [198]:
dc_dV=np.copy(B)

Altogether:

In [175]:
dl_dV=dl_dC_act*dact_c_dC*B
dl_dV

array([-0.05089695, -0.01993249,  0.3749755 ,  0.05379077])

also get

$\frac{\partial l}{\partial W}=\frac{\partial l}{\partial C_{act}} 
\frac{\partial act_c}{\partial C} \frac{\partial c}{\partial B_{act}}
\frac{\partial act_b}{\partial B} \frac{\partial b}{\partial W}$

where
the first two values are calculated from above, and  
1. $\frac{\partial c(B_{act},V)}{\partial B_{act}}=\frac{\partial}{\partial B_{act}}(V_{ij}B_{act,j})\\
= V_{ij}\frac{\partial}{\partial B_{act}}B_{act,j}\\
=V_{ij}$

In [200]:
dc_dB_act=np.copy(V)

2. $\frac{\partial act_{b}(B)}{\partial B}= act_b(B_i)(1-act_b(B_i))$ from above results

In [202]:
dact_b_dB=sigmoid(B)*(1-sigmoid(B))

3. $\frac{\partial b}{\partial W}= A_i$ from above result as well

altogether

In [203]:
np.dot(   np.dot(dl_dC_act,  dact_c_dC*dc_dB_act  ).T   ,dact_b_dB)

array([0.0782113])

In [204]:
np.array([A,A,A,A])

array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]])

In [206]:
dl_dW=   np.dot(   np.dot(dl_dC_act,  dact_c_dC*dc_dB_act  ).T   ,dact_b_dB)*   np.array([A,A,A,A])

In [207]:
W=W-dl_dW.T
W

array([[ 1.31680524,  0.64918314,  1.44367047, -0.3512471 ],
       [-1.59576699,  0.82262018,  0.62092393,  0.56616541],
       [-0.67375054,  0.39160322, -1.05787226,  0.89798339]])

In [208]:
dl_dV.shape

(4,)

In [212]:
V=V.T-dl_dV
V

array([[-0.10723884,  2.09341734, -0.37149823,  0.15881946]])

# Train