<a href="https://colab.research.google.com/github/gibranfp/CursoAprendizajeProfundo/blob/master/notebooks/1c_retropropagacion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retropropagación

En este *notebook* programaremos con NumPy una red neuronal densa y la entrenaremos para aproximar la operación XOR usando del gradiente descedente con el algoritmo de retropropagación. Recordemos que la operación XOR ($\otimes$) está de la siguiente manera:

| $x_1$ | $x_2$ | $y$
| ------------- |:-------------:| -----:|
|0 |0 |0|
|0 |1 |1|
|1 |0 |1|
|1 |1 |0|


In [1]:
import numpy as np
import cupy as cp

Nuestra red neuronal densa está compuesta por una capa de 2 entradas ($x_1$ y $x_2$), una capa oculta con 10 neuronas con función de activación sigmoide y una capa de salida con una sola neurona con función de activación sigmoide. Esta función de activación se define como:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

In [2]:
def sigmoide(z):
    return 1 / (1 + np.exp(-z))

La función sigmoide tiene una derivada que está expresada en términos de la misma función, esto es, 

$$
\frac{\partial \sigma (z)}{\partial z} = \sigma(z) (1 - \sigma(z))
$$

In [3]:
def derivada_sigmoide(x):
    return np.multiply(sigmoide(x), (1.0 - sigmoide(x)))

Podemos ver la operación XOR como una tarea de clasificación binaria a partir de 2 entradas. Por lo tanto, usaremos la función de pérdida de entropía cruzada binaria:

$$
ECB(\mathbf{y}, \mathbf{\hat{y}})  = -\sum_{i=1}^N \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]
$$

In [4]:
def entropia_cruzada_binaria(y, p):
    p[p == 0] = np.nextafter(0., 1.)
    p[p == 1] = np.nextafter(1., 0.)
    return -(np.log(p[y == 1]).sum() + np.log(1 - p[y == 0]).sum())

Asimismo, calcularemos la exactitud para medir el rendimiento del modelo aprendido por la red neuronal densa:

$$
exactitud = \frac{correctos}{total}
$$

In [5]:
def exactitud(y, y_predicha):
    return (y == y_predicha).mean() * 100

Ahora, definimos la función que propaga hacia adelante una entrada $\mathbf{x}^{i}$. Como la red está compuesta de 2 capas densas (1 oculta y 1 de salida), tenemos 2 matrices de pesos con sus correspondientes vectores de sesgos $\{\mathbf{W}^{\{1\}}, \mathbf{b}^{\{1\}}\}$ y $\{\mathbf{W}^{\{2\}}, \mathbf{b}^{\{2\}}\}$ de la capa oculta y la capa de salida respectivamente. Así, podemos llevar a cabo la propagación hacia adelante en esta red de la siguiente manera:

$$
	\begin{split}
				\mathbf{a}^{\{1\}} & =  \mathbf{x}^{(i)} \\
				\mathbf{z}^{\{2\}} & =  \mathbf{W}^{\{1\}} \cdot \mathbf{a}^{\{1\}} + \mathbf{b}^{\{1\}}\\
				\mathbf{a}^{\{2\}} & =  \sigma(\mathbf{z}^{\{2\}}) \\
				\mathbf{z}^{\{3\}} & =  \mathbf{W}^{\{2\}} \cdot \mathbf{a}^{\{2\}}  + \mathbf{b}^{\{2\}}\\
				\mathbf{a}^{\{3\}} & =  \sigma(\mathbf{z}^{\{3\}})\\
				\hat{y}^{(i)} & =  \mathbf{a}^{\{3\}}
			\end{split}
      $$

In [6]:
def hacia_adelante(x, W1, b1, W2, b2):
    z2 = np.dot(W1.T, x[:, np.newaxis]) + b1
    a2 = sigmoide(z2)
    z3 = np.dot(W2.T, a2) + b2
    y_hat = sigmoide(z3)
  
    return z2, a2, z3, y_hat

Finalmente, definimos la función para entrenar nuestra red neuronal usando gradiente descendente. Para calcular el gradiente de la función de pérdida respecto a los pesos y sesgos en cada capa empleamos el algoritmo de retropropagación.



In [7]:
def retropropagacion(X, y, alpha = 0.01, n_epocas = 100, n_ocultas = 10):
    n_ejemplos = X.shape[0]
    n_entradas = X.shape[1]
    
    # Inicialización de las matrices de pesos W y V
    W1 = np.sqrt(1.0 / n_entradas) * np.random.randn(n_entradas, n_ocultas)
    b1 = np.zeros((n_ocultas, 1))
    
    W2 = np.sqrt(1.0 / n_ocultas) * np.random.randn(n_ocultas, 1)
    b2 = np.zeros((1, 1))
    
    perdidas = np.zeros((n_epocas))
    exactitudes = np.zeros((n_epocas))
    y_predicha = np.zeros((y.shape))
    for i in range(n_epocas):
        for j in range(n_ejemplos):
            z2, a2, z3, y_hat = hacia_adelante(X[j], W1, b1, W2, b2)

            # cálculo de gradiente para W2 por retropropagación
            delta3 = (y_hat - y[j]) * derivada_sigmoide(z3) 
            W2 = W2 - alpha * np.outer(a2, delta3)
            b2 = b2 - alpha * delta3

            # cálculo de gradiente para W1 por retropropagación
            delta2 = np.dot(W2, delta3) * derivada_sigmoide(z2)
            W1 = W1 - alpha * np.outer(X[j], delta2)
            b1 = b1 - alpha * delta2

            y_predicha[j] = y_hat
            
        # calcula la pérdida en la época
        perdidas[i] = entropia_cruzada_binaria(y, y_predicha)
        exactitudes[i] = exactitud(y, np.round(y_predicha))
        print('Epoch {0}: Pérdida = {1} Exactitud = {2}'.format(i, 
                                                              perdidas[i], 
                                                              exactitudes[i]))

    return W1, W2, perdidas, exactitudes

Para probar nuestra red, generamos los ejemplos correspondientes a la operación XOR.

In [8]:
# ejemplo (XOR)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0, 1, 1, 0]]).T

Finalmente, entrenamos nuestra red con estos ejemplos por 200 épocas usando una tasa de aprendizaje $\alpha = 1.0$.

In [9]:
np.random.seed(0)
W1, W2, perdidas, exactitudes = retropropagacion(X, 
                                                 y, 
                                                 alpha = 1.0, 
                                                 n_epocas = 200,
                                                 n_ocultas = 10)

[[0.5]
 [0.5]
 [0.5]
 [0.5]
 [0.5]
 [0.5]
 [0.5]
 [0.5]
 [0.5]
 [0.5]]
[[0.53311749]
 [0.73566063]
 [0.62971697]
 [0.52416677]
 [0.57227128]
 [0.56329754]
 [0.74239054]
 [0.46492639]
 [0.55152145]
 [0.35028535]]
[[0.7774543 ]
 [0.57063245]
 [0.66688212]
 [0.83023907]
 [0.7895173 ]
 [0.3345627 ]
 [0.66261142]
 [0.47384355]
 [0.48224681]
 [0.57204407]]
[[0.78605751]
 [0.79127772]
 [0.77840211]
 [0.84040919]
 [0.84362753]
 [0.38112402]
 [0.85033667]
 [0.43707982]
 [0.54794098]
 [0.43123322]]
Epoch 0: Pérdida = 3.4737521571212318 Exactitud = 25.0
[[0.50296395]
 [0.50101852]
 [0.5008878 ]
 [0.50150058]
 [0.50031669]
 [0.50265438]
 [0.5014267 ]
 [0.50119235]
 [0.49960912]
 [0.4990332 ]]
[[0.53406601]
 [0.73707204]
 [0.63162112]
 [0.52510798]
 [0.57600481]
 [0.5659117 ]
 [0.74387175]
 [0.46637189]
 [0.55163048]
 [0.34944419]]
[[0.78031027]
 [0.57249319]
 [0.66825357]
 [0.83137404]
 [0.79018537]
 [0.3380944 ]
 [0.66450255]
 [0.47555352]
 [0.48148024]
 [0.57062196]]
[[0.78663511]
 [0.7930523 ]


 [0.60736859]]
[[0.82549432]
 [0.90895833]
 [0.89829819]
 [0.84370604]
 [0.96630296]
 [0.56138508]
 [0.93310924]
 [0.5560886 ]
 [0.6484074 ]
 [0.44093327]]
Epoch 65: Pérdida = 3.491876218919175 Exactitud = 25.0
[[0.60785501]
 [0.61530621]
 [0.61027425]
 [0.56942771]
 [0.61106452]
 [0.59232763]
 [0.60714254]
 [0.57274371]
 [0.58400284]
 [0.53545839]]
[[0.45438629]
 [0.83649375]
 [0.76752189]
 [0.4727279 ]
 [0.83114047]
 [0.66984641]
 [0.83615694]
 [0.54263083]
 [0.63883223]
 [0.36108252]]
[[0.9083749 ]
 [0.75420448]
 [0.80461087]
 [0.89513445]
 [0.89706752]
 [0.48899534]
 [0.80562944]
 [0.58974332]
 [0.58625114]
 [0.60813879]]
[[0.82697286]
 [0.91024179]
 [0.89960621]
 [0.84402006]
 [0.96709739]
 [0.56415748]
 [0.93413432]
 [0.55797801]
 [0.65008832]
 [0.44117972]]
Epoch 66: Pérdida = 3.4891911065517616 Exactitud = 25.0
[[0.6106255 ]
 [0.61682556]
 [0.61168249]
 [0.57103603]
 [0.61138366]
 [0.59373878]
 [0.6085166 ]
 [0.57387378]
 [0.58529327]
 [0.53600739]]
[[0.45365433]
 [0.83765629]


 [0.39441956]]
Epoch 127: Pérdida = 2.968925623935632 Exactitud = 50.0
[[0.76983575]
 [0.68782352]
 [0.67598758]
 [0.69005678]
 [0.55443333]
 [0.66459702]
 [0.66509915]
 [0.63541948]
 [0.65308168]
 [0.55316499]]
[[0.40242029]
 [0.8858819 ]
 [0.83790308]
 [0.46624817]
 [0.89520802]
 [0.79828914]
 [0.88976354]
 [0.6193697 ]
 [0.69237623]
 [0.29303819]]
[[0.98408944]
 [0.8694566 ]
 [0.88637493]
 [0.96120185]
 [0.93900645]
 [0.59401933]
 [0.90065917]
 [0.69076379]
 [0.70091808]
 [0.64130312]]
[[0.92020438]
 [0.96015331]
 [0.95212064]
 [0.90183969]
 [0.99136329]
 [0.73433798]
 [0.97457371]
 [0.67505853]
 [0.73909909]
 [0.39135846]]
Epoch 128: Pérdida = 2.955637323757989 Exactitud = 50.0
[[0.77143697]
 [0.68874476]
 [0.67683149]
 [0.69164547]
 [0.55321988]
 [0.66525858]
 [0.66581573]
 [0.63627933]
 [0.65399149]
 [0.55301704]]
[[0.40103652]
 [0.88639871]
 [0.83863741]
 [0.46646324]
 [0.89578605]
 [0.80058065]
 [0.89034242]
 [0.62057456]
 [0.69310507]
 [0.28985423]]
[[0.98452704]
 [0.87070211]

 [0.55381286]]
[[0.28306253]
 [0.90704687]
 [0.86943191]
 [0.42690959]
 [0.91960797]
 [0.90203853]
 [0.9143353 ]
 [0.67158586]
 [0.73072723]
 [0.09023636]]
[[0.9956207 ]
 [0.91483398]
 [0.92333088]
 [0.98232303]
 [0.96139145]
 [0.49390586]
 [0.93988304]
 [0.75526864]
 [0.76569412]
 [0.70427483]]
[[0.95041849]
 [0.97551027]
 [0.97020594]
 [0.93373081]
 [0.99682816]
 [0.81507419]
 [0.98674781]
 [0.75341486]
 [0.79763353]
 [0.18038162]]
Epoch 181: Pérdida = 1.7989431569565686 Exactitud = 100.0
[[0.81579764]
 [0.7281419 ]
 [0.71533931]
 [0.73799297]
 [0.48960534]
 [0.64625413]
 [0.69733038]
 [0.674493  ]
 [0.69298348]
 [0.55363686]]
[[0.28053068]
 [0.90728   ]
 [0.86981788]
 [0.42513521]
 [0.91966522]
 [0.90344059]
 [0.91457781]
 [0.67212958]
 [0.73122157]
 [0.08766745]]
[[0.99571084]
 [0.91527362]
 [0.923705  ]
 [0.98253515]
 [0.96146652]
 [0.48939411]
 [0.94022013]
 [0.75598935]
 [0.76638934]
 [0.70708104]]
[[0.95064579]
 [0.97565792]
 [0.97039242]
 [0.93390614]
 [0.99686369]
 [0.8154937

Graficamos el valor de la pérdida y la exactitud en cada época para ver el comportamiento de nuestra red durante el entrenamiento:

In [10]:
import matplotlib.pyplot as plt
plt.plot(np.arange(perdidas.size), perdidas, label='ECB')
plt.plot(np.arange(exactitudes.size), exactitudes, label='Exactitud')
plt.legend()
plt.grid(True)
plt.show()

<Figure size 640x480 with 1 Axes>

## Inicializando los pesos con zeros
Como se mencionó anteriormente, las matrices de pesos $\mathbf{W^{\{1\}}}$ y $\mathbf{W^{\{2\}}}$ se initializan con valores aleatorios pequeños mientras que los vectores de sesgo $\mathbf{b^{\{1\}}}$ y $\mathbf{b^{\{1\}}}$ con zeros. Examinemos qué pasa si inicializamos las matrices de pesos con zeros. Observa los valores de los pesos en cada época.

In [11]:
def retropropagacion_zeros(X, y, alpha = 0.1, n_epocas = 100, n_ocultas = 10):
    n_ejemplos = X.shape[0]
    n_entradas = X.shape[1]
    
    # Inicializa matrices de pesos W1 y W2 y vectores de sesgos b1 y b2
    W1 = np.zeros((n_entradas, n_ocultas))
    b1 = np.zeros((n_ocultas, 1)) 
    W2 = np.zeros((n_ocultas, 1))
    b2 = np.zeros((1, 1))
    
    perdidas = np.zeros((n_epocas))
    exactitudes = np.zeros((n_epocas))
    y_predicha = np.zeros((y.shape))
    for i in range(n_epocas):
        for j in range(n_ejemplos):
            z2, a2, z3, y_hat = hacia_adelante(X[j], W1, b1, W2, b2)

            # cálculo de gradiente para W2 por retropropagación
            delta3 = (y[j] - y_hat) * derivada_sigmoide(z3)
            W2 = W2 - alpha * np.outer(a2, delta3)
            b2 = b2 - alpha * delta3
            
            # calculo de gradiente para W1 por retropropagación
            delta2 = np.dot(W2, delta3) * derivada_sigmoide(z2)
            W1 = W1 - alpha * np.outer(X[j], delta2)
            b1 = b1 - alpha * delta2
            
            y_predicha[j] = y_hat
            
        # calcula la pérdida en época
        perdidas[i] = entropia_cruzada_binaria(y, y_predicha)
        exactitudes[i] = exactitud(y, np.round(y_predicha))
        print('Epoch {0}: Pérdida = {1} Exactitud = {2}'.format(i, 
                                                              perdidas[i], 
                                                              exactitudes[i]))
        print('W1 = {0}'.format(W1))
        print('W2 = {0}'.format(W2))
            
    return W1, W2, perdidas, exactitudes

In [12]:
W1, W2, perdidas, exactitudes = retropropagacion_zeros(X, 
                                                       y, 
                                                       alpha = 1.0,
                                                       n_epocas = 5,
                                                       n_ocultas = 10)

Epoch 0: Pérdida = 2.383938269247035 Exactitud = 100.0
W1 = [[ 0.0015006   0.0015006   0.0015006   0.0015006   0.0015006   0.0015006
   0.0015006   0.0015006   0.0015006   0.0015006 ]
 [-0.00013937 -0.00013937 -0.00013937 -0.00013937 -0.00013937 -0.00013937
  -0.00013937 -0.00013937 -0.00013937 -0.00013937]]
W2 = [[0.00877009]
 [0.00877009]
 [0.00877009]
 [0.00877009]
 [0.00877009]
 [0.00877009]
 [0.00877009]
 [0.00877009]
 [0.00877009]
 [0.00877009]]
Epoch 1: Pérdida = 2.3940150790588524 Exactitud = 75.0
W1 = [[3.07843345e-03 3.07843345e-03 3.07843345e-03 3.07843345e-03
  3.07843345e-03 3.07843345e-03 3.07843345e-03 3.07843345e-03
  3.07843345e-03 3.07843345e-03]
 [6.32326402e-05 6.32326402e-05 6.32326402e-05 6.32326402e-05
  6.32326402e-05 6.32326402e-05 6.32326402e-05 6.32326402e-05
  6.32326402e-05 6.32326402e-05]]
W2 = [[0.03001629]
 [0.03001629]
 [0.03001629]
 [0.03001629]
 [0.03001629]
 [0.03001629]
 [0.03001629]
 [0.03001629]
 [0.03001629]
 [0.03001629]]
Epoch 2: Pérdida = 2.45