<img style="float: left;;" src='https://github.com/gdesirena/Procesamiento_Natural_del_Lenguaje_2024/blob/main/Modulo%20III/Figures/alinco.png?raw=1' /></a>

# Modulo III: Vectores Palabra (Word Embeddings) y CBOW 02


Veremos cómo preparar los datos para aplicar:


- Propagación hacia adelante (Forward propagation).

- Pérdida de entropía cruzada (crosss-entropy loss).

- Retropropagación (Backpropagation).

- Descenso de gradiente (gradient descent).


In [1]:
import numpy as np

## Forward propagation


<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://github.com/gdesirena/Procesamiento_Natural_del_Lenguaje_2024/blob/main/Modulo%20III/Figures/cbow_model_dimensions_single_input.png?raw=1' alt="alternate text" width="width" height="height" style="width:839;height:349;" /> Figure 2 </div>

In [2]:
N= 3
V = 5

### Inicialización de pesos y bías

In [3]:
W1 = np.array([[ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
               [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
               [ 0.26637602, -0.23846886, -0.37770863, -0.11399446,  0.34008124]])


W2 = np.array([[-0.22182064, -0.43008631,  0.13310965],
               [ 0.08476603,  0.08123194,  0.1772054 ],
               [ 0.1871551 , -0.06107263, -0.1790735 ],
               [ 0.07055222, -0.02015138,  0.36107434],
               [ 0.33480474, -0.39423389, -0.43959196]])


b1 = np.array([[ 0.09688219],
               [ 0.29239497],
               [-0.27364426]])


b2 = np.array([[ 0.0352008 ],
               [-0.36393384],
               [-0.12775555],
               [-0.34802326],
               [-0.07017815]])

In [None]:
#W1 = np.random.normal(3,5) # vectores o matrices que se generan aleatoriamente
#W2 = np.random.normal(5,3) # vectores o matrices que se generan aleatoriamente


Agregar las funciones vistas en los notebooks pasados

In [6]:
def get_dict(data):
    words = sorted(list(set(data)))
    n = len(words)
    idx = 0
    # return these correctly
    word2Ind = {}
    Ind2word = {}
    for k in words:
        word2Ind[k] = idx
        Ind2word[idx] = k
        idx += 1
    return word2Ind, Ind2word

def get_windows(words, C):
    i = C
    while i < len(words) - C:
        center_word = words[i]
        context_words = words[(i - C):i] + words[(i+1):(i+C+1)]
        yield context_words, center_word
        i += 1

def word_to_one_hot_vector(word, word2Ind, V):
    one_hot_vector = np.zeros(V)
    one_hot_vector[word2Ind[word]] = 1
    return one_hot_vector

def context_words_to_vector(context_words, word2Ind, V):
    context_words_vectors = [word_to_one_hot_vector(w, word2Ind, V) for w in context_words]
    context_words_vectors = np.mean(context_words_vectors, axis=0)
    return context_words_vectors

def get_training_example(words, C, word2Ind, V):
    for context_words, center_word in get_windows(words, C):
        yield context_words_to_vector(context_words, word2Ind, V), word_to_one_hot_vector(center_word, word2Ind, V)


In [4]:
words = ['i', 'am', 'happy','because', 'i', 'am', 'learning']

In [8]:
 word2Ind, Ind2word = get_dict(words)


In [9]:
word2Ind, Ind2word

({'am': 0, 'because': 1, 'happy': 2, 'i': 3, 'learning': 4},
 {0: 'am', 1: 'because', 2: 'happy', 3: 'i', 4: 'learning'})

## Datos de entrenamiento

In [10]:
training_examples = get_training_example(words, 2, word2Ind, V)

In [11]:
training_examples

<generator object get_training_example at 0x7ca35a0bceb0>

In [12]:
x_array, y_array = next(training_examples)

In [13]:
x_array

array([0.25, 0.25, 0.  , 0.5 , 0.  ])

In [14]:
y_array

array([0., 0., 1., 0., 0.])

In [16]:
x = x_array.copy()

In [17]:
x.reshape(V,1)

array([[0.25],
       [0.25],
       [0.  ],
       [0.5 ],
       [0.  ]])

In [18]:
x.shape=(V,1)

In [19]:
x

array([[0.25],
       [0.25],
       [0.  ],
       [0.5 ],
       [0.  ]])

In [20]:
y = y_array.copy()

In [22]:
y.shape= (V,1)

Funciones de Activación

In [23]:
def relu(z):
    result = z.copy()
    result[result<0] = 0
    return result

def softmax(z):
    e_z = np.exp(z)
    sum_e_z = np.sum(e_z)
    return e_z / sum_e_z

## Forward

### Valores de la capa oculta

\begin{align}
\mathbf{z_1} = \mathbf{W_1}\mathbf{x} + \mathbf{b1} \\
\mathbf{h} = \mathbf{ReLu}(\mathbf{z_1)} \\
\end{align}

In [24]:
z1 = np.dot(W1,x) + b1

In [25]:
z1

array([[ 0.36483875],
       [ 0.63710329],
       [-0.3236647 ]])

In [26]:
h = relu(z1)
h

array([[0.36483875],
       [0.63710329],
       [0.        ]])

### Valores de la capa de salida

In [27]:
z2 = np.dot(W2, h)+b2
z2

array([[-0.31973737],
       [-0.28125477],
       [-0.09838369],
       [-0.33512159],
       [-0.19919612]])

\begin{align}
\mathbf{z_2} = \mathbf{W_2}\mathbf{h} + \mathbf{b2} \\
\mathbf{\hat{y}} = \mathbf{softmax}(\mathbf{z_2)} \\
\end{align}

In [29]:
y_hat = softmax(z2)
y_hat

array([[0.18519074],
       [0.19245626],
       [0.23107446],
       [0.18236353],
       [0.20891502]])

In [30]:
y

array([[0.],
       [0.],
       [1.],
       [0.],
       [0.]])

### Cross-entropy loss

$$ J = -\sum\limits_{k=1}^{V} y_k \log{\hat{y}_k}$$

In [31]:
def cross_entropy_loss(y_predicted, y_actual):
    loss = np.sum(-np.log(y_predicted)*y_actual)
    return loss

In [32]:
cross_entropy_loss(y_hat, y)


1.4650152923611106

### Backpropagation

Las formulas que necesitamos para implementar el backpropagation son:


\begin{align}
 \frac{\partial J}{\partial \mathbf{W_1}} &= \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\mathbf{x}^\top \tag{7}\\
 \frac{\partial J}{\partial \mathbf{W_2}} &= (\mathbf{\hat{y}} - \mathbf{y})\mathbf{h^\top} \tag{8}\\
 \frac{\partial J}{\partial \mathbf{b_1}} &= \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right ) \tag{9}\\
 \frac{\partial J}{\partial \mathbf{b_2}} &= \mathbf{\hat{y}} - \mathbf{y} \tag{10}
\end{align}


Calcule la derivada parcial de la función de pérdida con respecto a $ \mathbf {b_2} $ y almacene el resultado en `grad_b2`.


$$\frac{\partial J}{\partial \mathbf{b_2}} = \mathbf{\hat{y}} - \mathbf{y} \tag{10}$$

In [34]:
grad_b2 = y_hat - y

Calcular la derivada parcial de la función con respecto a $ \mathbf {w_2} $, y guardarlo en `grad_W2`

$$\frac{\partial J}{\partial \mathbf{W_2}} = (\mathbf{\hat{y}} - \mathbf{y})\mathbf{h^\top} \tag{8}$$


In [35]:
grad_w2 = np.dot((y_hat - y), h.T)
grad_w2

array([[ 0.06756476,  0.11798563,  0.        ],
       [ 0.0702155 ,  0.12261452,  0.        ],
       [-0.28053384, -0.48988499,  0.        ],
       [ 0.06653328,  0.1161844 ,  0.        ],
       [ 0.07622029,  0.13310045,  0.        ]])

**Ahora, calcule la derivada con respecto a $\mathbf{b_1}$ y guardar el resultado en `grad_b1`.**

$$\frac{\partial J}{\partial \mathbf{b_1}} = \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right ) \tag{9}$$

In [36]:
grad_b1 = relu(np.dot(W2.T, (y_hat - y)))
grad_b1

array([[0.        ],
       [0.        ],
       [0.17045858]])

**Finalmente, calcular la derivada parcial del loss con respecto a $\mathbf{W_1}$, y guardarlo en`grad_W1`.**

$$\frac{\partial J}{\partial \mathbf{W_1}} = \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\mathbf{x}^\top \tag{7}$$

In [37]:
grad_w1 = np.dot(relu(np.dot(W2.T, grad_b2)), x.T)
grad_w1

array([[0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.04261464, 0.04261464, 0.        , 0.08522929, 0.        ]])

## Gradiante descendente

Durante la fase del gradiante descendente, actualizará los pesos y los bías $ \alpha $ veces el gradiente de las matrices y vectores originales, utilizando las siguientes fórmulas.


\begin{align}
 \mathbf{W_1} &:= \mathbf{W_1} - \alpha \frac{\partial J}{\partial \mathbf{W_1}} \tag{11}\\
 \mathbf{W_2} &:= \mathbf{W_2} - \alpha \frac{\partial J}{\partial \mathbf{W_2}} \tag{12}\\
 \mathbf{b_1} &:= \mathbf{b_1} - \alpha \frac{\partial J}{\partial \mathbf{b_1}} \tag{13}\\
 \mathbf{b_2} &:= \mathbf{b_2} - \alpha \frac{\partial J}{\partial \mathbf{b_2}} \tag{14}\\
\end{align}


In [38]:
alpha = 0.03

In [39]:
W1_n = W1 - alpha*grad_w1
W2_n = W2 - alpha*grad_w2
b1_n = b1 - alpha*grad_b1
b2_n = b2 - alpha*grad_b2

In [40]:
W1_n

array([[ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
       [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
       [ 0.26509758, -0.2397473 , -0.37770863, -0.11655134,  0.34008124]])

In [41]:
W1

array([[ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
       [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
       [ 0.26637602, -0.23846886, -0.37770863, -0.11399446,  0.34008124]])

In [42]:
W1_n - W1

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.00127844, -0.00127844,  0.        , -0.00255688,  0.        ]])

**Ahora Calcule los nuevo valores de $\mathbf{W_2}$ (que serán guardados en `W2_new`), $\mathbf{b_1}$ (en `b1_new`), y $\mathbf{b_2}$ (en `b2_new`).**

\begin{align}
 \mathbf{W_2} &:= \mathbf{W_2} - \alpha \frac{\partial J}{\partial \mathbf{W_2}} \tag{12}\\
 \mathbf{b_1} &:= \mathbf{b_1} - \alpha \frac{\partial J}{\partial \mathbf{b_1}} \tag{13}\\
 \mathbf{b_2} &:= \mathbf{b_2} - \alpha \frac{\partial J}{\partial \mathbf{b_2}} \tag{14}\\
\end{align}

## Opción 1: extraer los embeddings de W1

In [44]:
W1_n

array([[ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
       [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
       [ 0.26509758, -0.2397473 , -0.37770863, -0.11655134,  0.34008124]])

In [45]:
for i in range(V):
  print(Ind2word[i])

am
because
happy
i
learning


In [46]:
word2Ind

{'am': 0, 'because': 1, 'happy': 2, 'i': 3, 'learning': 4}

## Opción 2: extraer los embeddings de W2

In [47]:
W2_n

array([[-0.22384758, -0.43362588,  0.13310965],
       [ 0.08265956,  0.0775535 ,  0.1772054 ],
       [ 0.19557112, -0.04637608, -0.1790735 ],
       [ 0.06855622, -0.02363691,  0.36107434],
       [ 0.33251813, -0.3982269 , -0.43959196]])

## Opción 3: extraer los embeddings de W1 y W2

In [49]:
w3 = (W1 + W2.T)/3

In [50]:
w3

array([[ 0.06501765,  0.05776931, -0.01593238,  0.1179192 ,  0.25093527],
       [-0.03424377,  0.10306114, -0.1001974 ,  0.13053734, -0.21115911],
       [ 0.13316189, -0.02042115, -0.18559404,  0.08235996, -0.03317024]])

In [None]:
def gradient_descent(xtrain,ytrain, N, V, numiter, alpha):

