# **Understading CBOW (*continuous bag of words*) Word Embeddings Step-by-step**

In this notebook, we'll try out a small example to create word embeddings.

There are two main parts: data preparation, and the continuous bag-of-words (CBOW) model.

To get started, import and initialize all the libraries we will need.

In [1]:
import sys
!{sys.executable} -m pip install emoji

Collecting emoji
[?25l  Downloading https://files.pythonhosted.org/packages/40/8d/521be7f0091fe0f2ae690cc044faf43e3445e0ff33c574eae752dd7e39fa/emoji-0.5.4.tar.gz (43kB)
[K     |███████▌                        | 10kB 16.0MB/s eta 0:00:01[K     |███████████████                 | 20kB 1.5MB/s eta 0:00:01[K     |██████████████████████▋         | 30kB 2.0MB/s eta 0:00:01[K     |██████████████████████████████▏ | 40kB 2.3MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 1.6MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.5.4-cp36-none-any.whl size=42176 sha256=2b6487fb4342cb2cb56d56c710426057114e6ec5c4755ec8a175786b80ff72c3
  Stored in directory: /root/.cache/pip/wheels/2a/a9/0a/4f8e8cce8074232aba240caca3fade315bb49fac68808d1a9c
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.5.4


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [3]:
! cp '/content/drive/My Drive/Colab Notebooks/NLP-with-probabilistic-models/cbow-word-embeddings/utils2.py' '/content'

In [4]:
import re
import nltk
from nltk.tokenize import word_tokenize
import emoji
import numpy as np

from utils2 import get_dict

# download pre-trained Punkt tokenizer for English
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## **Data preparation**

In the data preparation phase, we will start with a corpus of text, and:

- Clean and tokenize the corpus.

- Extract the pairs of context words and center word that will make up the training data set for the CBOW model. The context words are the features that will be fed into the model, and the center words are the target values that the model will learn to predict.

- Create simple vector representations of the context words (features) and center words (targets) that can be used by the neural network of the CBOW model.

#### **Cleaning and tokenization**

To understand the cleaning and tokenization process, we will consider a small example (corpus) that contains emojis and various punctuation signs.

In [5]:
corpus = 'Who ❤️ "word embeddings" in 2020? I do!!!'

First, we will replace all interrupting punctuation signs — such as commas and exclamation marks — with periods.

In [6]:
print(f'Corpus:  {corpus}')
data = re.sub(r'[,!?;-]+', '.', corpus)
print(f'After cleaning punctuation:  {data}')

Corpus:  Who ❤️ "word embeddings" in 2020? I do!!!
After cleaning punctuation:  Who ❤️ "word embeddings" in 2020. I do.


Next, we will use NLTK's tokenization engine to split the corpus into individual tokens.

In [7]:
print(f'Initial string:  {data}')
data = nltk.word_tokenize(data)
print(f'After tokenization:  {data}')

Initial string:  Who ❤️ "word embeddings" in 2020. I do.
After tokenization:  ['Who', '❤️', '``', 'word', 'embeddings', "''", 'in', '2020', '.', 'I', 'do', '.']


Finally, we will get rid of numbers and punctuation other than periods, and convert all the remaining tokens to lowercase.

In [8]:
print(f'Initial list of tokens:  {data}')
data = [ ch.lower() for ch in data
         if ch.isalpha()
         or ch == '.'
         or emoji.get_emoji_regexp().search(ch)
       ]
print(f'After cleaning:  {data}')

Initial list of tokens:  ['Who', '❤️', '``', 'word', 'embeddings', "''", 'in', '2020', '.', 'I', 'do', '.']
After cleaning:  ['who', '❤️', 'word', 'embeddings', 'in', '.', 'i', 'do', '.']


Note that the heart emoji is considered as a token just like any normal word.

Now let's streamline the cleaning and tokenization process by wrapping the previous steps in a function.

In [9]:
def tokenize(corpus):
    '''
    Returns the cleaned and tokenized list of input data.
    '''
    data = re.sub(r'[,!?;-]+', '.', corpus)
    data = nltk.word_tokenize(data)  # tokenize string to words
    data = [ ch.lower() for ch in data
             if ch.isalpha()
             or ch == '.'
             or emoji.get_emoji_regexp().search(ch)
           ]
    return data

Let's apply this function to the corpus that we'll be working on in the rest of this notebook: "I am happy because I am learning continuous-bag-of-words word embeddings."

In [76]:
corpus = 'I am happy because I am learning'
print(f'Corpus:  {corpus}')
words = tokenize(corpus)
print(f'Words (tokens):  {words}')

Corpus:  I am happy because I am learning
Words (tokens):  ['i', 'am', 'happy', 'because', 'i', 'am', 'learning']


#### **Sliding window of words**

Now that we have transformed the corpus into a list of clean tokens, we can slide a window of words across this list. For each window we can extract a center word and the context words.

The `get_windows` function in the next cell helps us with that.

In [12]:
def get_windows(words, C):
    i = C
    while i < len(words) - C:
        center_word = words[i]
        context_words = words[(i - C):i] + words[(i+1):(i+C+1)]
        yield context_words, center_word
        i += 1

The first argument of this function is a list of words (or tokens). The second argument, `C`, is the context half-size. For a given center word, the context words are made of `C` words to the left and `C` words to the right of the center word.

Below is an illustation on how we can use this function to extract context words and center words from a list of tokens. These context and center words will make up the training set that we will use to train the CBOW model.

In [77]:
for x, y in get_windows(
            ['i', 'am', 'happy', 'because', 'i', 'am', 'learning'],
            2
        ):
    print(f'{x}\t{y}')

['i', 'am', 'because', 'i']	happy
['am', 'happy', 'i', 'am']	because
['happy', 'because', 'am', 'learning']	i


The first example of the training set is made of:

- the context words "i", "am", "because", "i",

- and the center word to be predicted: "happy".


In [78]:
# let's try and change the context half-size
for x, y in get_windows(tokenize("I am happy because I am learning"), 1):
    print(f'{x}\t{y}')

['i', 'happy']	am
['am', 'because']	happy
['happy', 'i']	because
['because', 'am']	i
['i', 'learning']	am


#### **Transforming words into vectors for the training set**

To finish preparing the training set, we need to transform the context words and center words into vectors.

#### Mapping words to indices and indices to words

The center words will be represented as one-hot vectors, and the vectors that represent context words are also based on one-hot vectors.

To create one-hot word vectors, we can start by mapping each unique word to a unique integer (or index). We will use a helper function, `get_dict` from **utils2.py**, that creates a Python dictionary that maps words to integers and back.

In [79]:
word2Ind, Ind2word = get_dict(words)

Here's the dictionary that maps words to numeric indices.

In [80]:
word2Ind

{'am': 0, 'because': 1, 'happy': 2, 'i': 3, 'learning': 4}

We can use this dictionary to get the index of a word.

In [81]:
print("Index of the word 'i':  ",word2Ind['i'])

Index of the word 'i':   3


And conversely, here's the dictionary that maps indices to words.

In [82]:
Ind2word

{0: 'am', 1: 'because', 2: 'happy', 3: 'i', 4: 'learning'}

In [83]:
print("Word which has index 2:  ",Ind2word[2] )

Word which has index 2:   happy


Finally, in order to get the size of the vocabulary of our corpus (the number of different words making up the corpus), we can simply get the length of either of these dictionaries.

In [84]:
V = len(word2Ind)
print("Size of vocabulary: ", V)

Size of vocabulary:  5


#### Getting one-hot word vectors

We can easily convert an integer, $n$, into a one-hot vector by simply assigning **1** to index $n$ and keeping the rest indices as **0**.

Consider the word "happy". First, retrieve its numeric index.

In [85]:
n = word2Ind['happy']
n

2

Now create a vector with the size of the vocabulary, and fill it with zeros.

In [86]:
center_word_vector = np.zeros(V)
center_word_vector

array([0., 0., 0., 0., 0.])

We can confirm that the vector has the right size.

In [87]:
len(center_word_vector) == V

True

Next, replace the 0 of the $n$-th element with a 1.

In [88]:
center_word_vector[n] = 1

And we have our one-hot word vector.

In [89]:
center_word_vector

array([0., 0., 1., 0., 0.])

We will now group all of these steps in a convenient function, which takes as parameters: a word to be encoded, a dictionary that maps words to indices, and the size of the vocabulary.

In [26]:
def word_to_one_hot_vector(word, word2Ind, V):
    one_hot_vector = np.zeros(V)
    one_hot_vector[word2Ind[word]] = 1
    
    return one_hot_vector

Let's check that it works as intended.

In [90]:
word_to_one_hot_vector('happy', word2Ind, V)

array([0., 0., 1., 0., 0.])

#### Getting context word vectors

To create the vectors that represent context words, we will simply calculate the average of the one-hot vectors representing the individual words.

Let's start with a list of context words.

In [91]:
context_words = ['i', 'am', 'because', 'i']

Using Python's list comprehension construct and the `word_to_one_hot_vector` function that we created above, we can create a list of one-hot vectors representing each of the context words.

In [92]:
context_words_vectors = [word_to_one_hot_vector(w, word2Ind, V) for w in context_words]
context_words_vectors

[array([0., 0., 0., 1., 0.]),
 array([1., 0., 0., 0., 0.]),
 array([0., 1., 0., 0., 0.]),
 array([0., 0., 0., 1., 0.])]

And we can now simply get the average of these vectors using numpy's `mean` function, to get the vector representation of the context words.

In [93]:
np.mean(context_words_vectors, axis=0)

array([0.25, 0.25, 0.  , 0.5 , 0.  ])

Now, we will create a `context_words_to_vector` function that takes in a list of context words, a word-to-index dictionary, and a vocabulary size, and outputs the vector representation of the context words.

In [31]:
def context_words_to_vector(context_words, word2Ind, V):
    context_words_vectors = [word_to_one_hot_vector(w, word2Ind, V) for w in context_words]
    context_words_vectors = np.mean(context_words_vectors, axis=0)
    
    return context_words_vectors

Let's check that we obtain the same output as the manual approach above.

In [94]:
context_words_to_vector(['i', 'am', 'because', 'i'], word2Ind, V)

array([0.25, 0.25, 0.  , 0.5 , 0.  ])

## **Building the training set**

We can now combine the functions that we created in the previous sections, to build a training set for the CBOW model, starting from the following tokenized corpus.

In [95]:
words

['i', 'am', 'happy', 'because', 'i', 'am', 'learning']

To do this, we need to use the sliding window function (`get_windows`) to extract the context words and center words, and then convert these sets of words into a basic vector representation using `word_to_one_hot_vector` and `context_words_to_vector`.

In [96]:
for context_words, center_word in get_windows(words, 2):
    print(f'Context words:  {context_words} -> {context_words_to_vector(context_words, word2Ind, V)}')
    print(f'Center word:  {center_word} -> {word_to_one_hot_vector(center_word, word2Ind, V)}')
    print()

Context words:  ['i', 'am', 'because', 'i'] -> [0.25 0.25 0.   0.5  0.  ]
Center word:  happy -> [0. 0. 1. 0. 0.]

Context words:  ['am', 'happy', 'i', 'am'] -> [0.5  0.   0.25 0.25 0.  ]
Center word:  because -> [0. 1. 0. 0. 0.]

Context words:  ['happy', 'because', 'am', 'learning'] -> [0.25 0.25 0.25 0.   0.25]
Center word:  i -> [0. 0. 0. 1. 0.]



Here, we have just performed a single iteration of training using a single example, but in real-life scenario we train the CBOW model using several iterations and batches of example.
Below is a glimpse of how we would use a Python generator function to make it easier to iterate over a set of examples.

In [97]:
def get_training_example(words, C, word2Ind, V):
    for context_words, center_word in get_windows(words, C):
        yield context_words_to_vector(context_words, word2Ind, V), word_to_one_hot_vector(center_word, word2Ind, V)

The output of this function can be iterated on to get successive context word vectors and center word vectors, as demonstrated in the next cell.

In [98]:
for context_words_vector, center_word_vector in get_training_example(words, 2, word2Ind, V):
    print(f'Context words vector:  {context_words_vector}')
    print(f'Center word vector:  {center_word_vector}')
    print()

Context words vector:  [0.25 0.25 0.   0.5  0.  ]
Center word vector:  [0. 0. 1. 0. 0.]

Context words vector:  [0.5  0.   0.25 0.25 0.  ]
Center word vector:  [0. 1. 0. 0. 0.]

Context words vector:  [0.25 0.25 0.25 0.   0.25]
Center word vector:  [0. 0. 0. 1. 0.]



Our training set is ready!

We can now move on to the CBOW model itself.

## **The continuous bag-of-words model**

The CBOW model is based on a neural network, the architecture of which looks like the figure below:

<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://angqpdfr.coursera-apps.org/notebooks/Week4/cbow_model_architecture.png?1' alt="alternate text" width="width" height="height" style="width:917;height:337;" /> </div>

Next up, we will see:

- The two activation functions used in the neural network we will design.

- Forward propagation.

- Cross-entropy loss.

- Backpropagation.

- Gradient descent.

- Extracting the word embedding vectors from the weight matrices once the neural network has been trained.

#### **Activation functions**

Let's start by implementing the activation functions, ReLU and softmax.

#### ReLU

ReLU is used to calculate the values of the hidden layer, in the following formulas:

\begin{align}
 \mathbf{z_1} &= \mathbf{W_1}\mathbf{x} + \mathbf{b_1} \\
 \mathbf{h} &= \mathrm{ReLU}(\mathbf{z_1}) \\
\end{align}


Let's fix a value for $\mathbf{z_1}$ as a working example.

In [37]:
np.random.seed(10)
z_1 = 10*np.random.rand(5, 1)-5
z_1

array([[ 2.71320643],
       [-4.79248051],
       [ 1.33648235],
       [ 2.48803883],
       [-0.01492988]])

To get the ReLU of this vector, we want all the negative values to become zeros.

First we will create a copy of this vector.

In [38]:
h = z_1.copy()

Now we will determine which of its values are negative.

In [39]:
h < 0

array([[False],
       [ True],
       [False],
       [False],
       [ True]])

We can now simply set all of the values which are negative to 0.

In [None]:
h[h < 0] = 0

And that's it, we have the ReLU of $\mathbf{z_1}$!

In [40]:
h

array([[ 2.71320643],
       [-4.79248051],
       [ 1.33648235],
       [ 2.48803883],
       [-0.01492988]])

Now we will implement ReLU as a function.

In [41]:
def relu(z):
    result = z.copy()
    result[result < 0] = 0
    
    return result

In [42]:
z = np.array([[-1.25459881], [ 4.50714306], [ 2.31993942], [ 0.98658484], [-3.4398136 ]])
relu(z)

array([[0.        ],
       [4.50714306],
       [2.31993942],
       [0.98658484],
       [0.        ]])

#### Softmax

The second activation function that we will be using is softmax. This function is used to calculate the values of the output layer of the neural network, using the following formulas:

\begin{align}
 \mathbf{z_2} &= \mathbf{W_2}\mathbf{h} + \mathbf{b_2}  \\
 \mathbf{\hat y} &= \mathrm{softmax}(\mathbf{z_2}) \\
\end{align}

To calculate softmax of a vector $\mathbf{z}$, the $i$-th component of the resulting vector is given by:

$$ \textrm{softmax}(\textbf{z})_i = \frac{e^{z_i} }{\sum\limits_{j=1}^{V} e^{z_j} }   $$

Let's work through an example.

In [43]:
z = np.array([9, 8, 11, 10, 8.5])
z

array([ 9. ,  8. , 11. , 10. ,  8.5])

We'll need to calculate the exponentials of each element, both for the numerator and for the denominator.

In [44]:
e_z = np.exp(z)
e_z

array([ 8103.08392758,  2980.95798704, 59874.1417152 , 22026.46579481,
        4914.7688403 ])

The denominator is equal to the sum of these exponentials.

In [45]:
sum_e_z = np.sum(e_z)
sum_e_z

97899.41826492078

And the value of the first element of $\textrm{softmax}(\textbf{z})$ is given by:

In [46]:
e_z[0]/sum_e_z

0.08276947985173956

This is for one element. We will use numpy's vectorized operations to calculate the values of all the elements of the $\textrm{softmax}(\textbf{z})$ vector in one go.

Let's Implement the softmax function.

In [47]:
def softmax(z):
    e_z = np.exp(z)
    sum_e_z = np.sum(e_z)
    
    return e_z / sum_e_z

In [48]:
softmax([9, 8, 11, 10, 8.5])

array([0.08276948, 0.03044919, 0.61158833, 0.22499077, 0.05020223])

### Dimensions: 1-D arrays vs 2-D column vectors

Before moving on to implement forward propagation, backpropagation, and gradient descent, let's have a look at the dimensions of the vectors we've been handling until now.

We will start with creating a vector of length $V$ filled with zeros.

In [99]:
x_array = np.zeros(V)
x_array

array([0., 0., 0., 0., 0.])

This is a 1-dimensional array, as revealed by the `.shape` property of the array.

In [100]:
x_array.shape

(5,)

To perform matrix multiplication in the next steps, we actually need our column vectors to be represented as a matrix with one column. In numpy, this matrix is represented as a 2-dimensional array.

The easiest way to convert a 1D vector to a 2D column matrix is to set its `.shape` property to the number of rows and one column, as shown in the next cell.

In [101]:
x_column_vector = x_array.copy()
x_column_vector.shape = (V, 1)  # alternatively ... = (x_array.shape[0], 1)
x_column_vector

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.]])

The shape of the resulting "vector" is:

In [102]:
x_column_vector.shape

(5, 1)

So now we have a 5x1 matrix that we can use to perform standard matrix multiplication.

#### **Forward propagation**

Let's dive into the neural network itself, which is shown below with all the dimensions and formulas we need.

<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://angqpdfr.coursera-apps.org/notebooks/Week4/cbow_model_dimensions_single_input.png?2' alt="alternate text" width="width" height="height" style="width:839;height:349;" /></div>

We will set $N$ equal to 3. $N$ is a hyperparameter of the CBOW model that represents the size of the word embedding vectors, as well as the size of the hidden layer.

In [53]:
N = 3

### Initialization of the weights and biases

Before we start training the neural network, we need to initialize the weight matrices and bias vectors with random values.

In [103]:
W1 = np.array([[ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
               [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
               [ 0.26637602, -0.23846886, -0.37770863, -0.11399446,  0.34008124]])

W2 = np.array([[-0.22182064, -0.43008631,  0.13310965],
               [ 0.08476603,  0.08123194,  0.1772054 ],
               [ 0.1871551 , -0.06107263, -0.1790735 ],
               [ 0.07055222, -0.02015138,  0.36107434],
               [ 0.33480474, -0.39423389, -0.43959196]])

b1 = np.array([[ 0.09688219],
               [ 0.29239497],
               [-0.27364426]])

b2 = np.array([[ 0.0352008 ],
               [-0.36393384],
               [-0.12775555],
               [-0.34802326],
               [-0.07017815]])

Let's check if the dimensions of these matrices match those shown in the figure above.

In [104]:
print(f'V (vocabulary size): {V}')
print(f'N (embedding size / size of the hidden layer): {N}')
print(f'size of W1: {W1.shape} (NxV)')
print(f'size of b1: {b1.shape} (Nx1)')
print(f'size of W2: {W1.shape} (VxN)')
print(f'size of b2: {b2.shape} (Vx1)')

V (vocabulary size): 5
N (embedding size / size of the hidden layer): 3
size of W1: (3, 5) (NxV)
size of b1: (3, 1) (Nx1)
size of W2: (3, 5) (VxN)
size of b2: (5, 1) (Vx1)


### Training example

Our training example is made of the vector representing the context words "i am because i", and the target which is the one-hot vector representing the center word "happy".

In [105]:
training_examples = get_training_example(words, 2, word2Ind, V)

> `get_training_examples`, which uses the `yield` keyword, is known as a generator. When run, it builds an iterator, which is a special type of object that we can iterate on (using a `for` loop for instance), to retrieve the successive values that the function generates.
>
> In this case `get_training_examples` `yield`s training examples, and iterating on `training_examples` will return the successive training examples.

In [106]:
x_array, y_array = next(training_examples)

> `next` is another special keyword, which gets the next available value from an iterator. Here, we'll get the very first value, which is the first training example. If we run this cell again, we'll get the next value, and so on until the iterator runs out of values to return.


The vector representing the context words, which will be fed into the neural network, is:

In [107]:
x_array

array([0.25, 0.25, 0.  , 0.5 , 0.  ])

The one-hot vector representing the center word to be predicted is:

In [108]:
y_array

array([0., 0., 1., 0., 0.])

Now we will convert these vectors into matrices (or 2D arrays) to be able to perform matrix multiplication on the right types of objects.

In [109]:
x = x_array.copy()
x.shape = (V, 1)
print('x')
print(x)
print()

y = y_array.copy()
y.shape = (V, 1)
print('y')
print(y)

x
[[0.25]
 [0.25]
 [0.  ]
 [0.5 ]
 [0.  ]]

y
[[0.]
 [0.]
 [1.]
 [0.]
 [0.]]


### Values of the hidden layer

Now that we have initialized all the variables that we need for forward propagation, we can calculate the values of the hidden layer using the following formulas:

\begin{align}
 \mathbf{z_1} = \mathbf{W_1}\mathbf{x} + \mathbf{b_1}  \\
 \mathbf{h} = \mathrm{ReLU}(\mathbf{z_1})  \\
\end{align}

First, we can calculate the value of $\mathbf{z_1}$.

In [110]:
z1 = np.dot(W1, x) + b1

> `np.dot` is numpy's function for matrix multiplication.

As expected we get an $N$ by 1 matrix, or column vector with $N$ elements, where $N$ is equal to the embedding size, which is 3 in this example.

In [111]:
z1

array([[ 0.36483875],
       [ 0.63710329],
       [-0.3236647 ]])

WE can now take the ReLU of $\mathbf{z_1}$ to get $\mathbf{h}$, the vector with the values of the hidden layer.

In [112]:
h = relu(z1)
h

array([[0.36483875],
       [0.63710329],
       [0.        ]])

Applying ReLU means that the negative element of $\mathbf{z_1}$ has been replaced with a zero.

### Values of the output layer

Here are the formulas we need to calculate the values of the output layer, represented by the vector $\mathbf{\hat y}$:

\begin{align}
 \mathbf{z_2} &= \mathbf{W_2}\mathbf{h} + \mathbf{b_2}   \\
 \mathbf{\hat y} &= \mathrm{softmax}(\mathbf{z_2})   \\
\end{align}

First, we will calculate $\mathbf{z_2}$.

In [113]:
z2 = np.dot(W2, h) + b2
z2

array([[-0.31973737],
       [-0.28125477],
       [-0.09838369],
       [-0.33512159],
       [-0.19919612]])

This is a $V$ by 1 matrix, where $V$ is the size of the vocabulary, which is 5 in this example.

Now we will calculate the value of $\mathbf{\hat y}$.

In [114]:
y_hat = softmax(z2)
y_hat

array([[0.18519074],
       [0.19245626],
       [0.23107446],
       [0.18236353],
       [0.20891502]])

As we've performed the calculations with random matrices and vectors (apart from the input vector), the output of the neural network is essentially random at this point. The learning process will adjust the weights and biases to match the actual targets better.

#### **Cross-entropy loss**

Now that we have the network's prediction, we can calculate the cross-entropy loss to determine how accurate the prediction was compared to the actual target.

> As we are working on a single training example, and not on a batch of examples, therefore we are using *loss* and not *cost*, which is the generalized form of loss.

In [115]:
y_hat

array([[0.18519074],
       [0.19245626],
       [0.23107446],
       [0.18236353],
       [0.20891502]])

And the actual target value is:

In [116]:
y

array([[0.],
       [0.],
       [1.],
       [0.],
       [0.]])

The formula for cross-entropy loss is:

$$ J=-\sum\limits_{k=1}^{V}y_k\log{\hat{y}_k} $$

We will now implement the cross-entropy loss function.

In [117]:
def cross_entropy_loss(y_predicted, y_actual):
    loss = np.sum(-np.log(y_hat)*y)
    
    return loss

Now let's use this function to calculate the loss with the actual values of $\mathbf{y}$ and $\mathbf{\hat y}$.

In [118]:
cross_entropy_loss(y_hat, y)

1.4650152923611106

This value is neither good nor bad, which is expected as the neural network hasn't learned anything yet.

The actual learning will start during the next phase: backpropagation.

#### **Backpropagation**

The formulas that we will implement for backpropagation are the following:

\begin{align}
 \frac{\partial J}{\partial \mathbf{W_1}} &= \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\mathbf{x}^\top \\
 \frac{\partial J}{\partial \mathbf{W_2}} &= (\mathbf{\hat{y}} - \mathbf{y})\mathbf{h^\top} \\
 \frac{\partial J}{\partial \mathbf{b_1}} &= \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\\
 \frac{\partial J}{\partial \mathbf{b_2}} &= \mathbf{\hat{y}} - \mathbf{y} 
\end{align}


Let's start with the easiest one.

We will calculate the partial derivative of the loss function with respect to $\mathbf{b_2}$, and store the result in `grad_b2`.

$$\frac{\partial J}{\partial \mathbf{b_2}} = \mathbf{\hat{y}} - \mathbf{y} $$

In [119]:
grad_b2 = y_hat - y
grad_b2

array([[ 0.18519074],
       [ 0.19245626],
       [-0.76892554],
       [ 0.18236353],
       [ 0.20891502]])

Next, we will calculate the partial derivative of the loss function with respect to $\mathbf{W_2}$, and store the result in `grad_W2`.

$$\frac{\partial J}{\partial \mathbf{W_2}} = (\mathbf{\hat{y}} - \mathbf{y})\mathbf{h^\top} $$

In [120]:
grad_W2 = np.dot(y_hat - y, h.T)
grad_W2

array([[ 0.06756476,  0.11798563,  0.        ],
       [ 0.0702155 ,  0.12261452,  0.        ],
       [-0.28053384, -0.48988499,  0.        ],
       [ 0.06653328,  0.1161844 ,  0.        ],
       [ 0.07622029,  0.13310045,  0.        ]])

Now we will calculate the partial derivative with respect to $\mathbf{b_1}$ and store the result in `grad_b1`.

$$\frac{\partial J}{\partial \mathbf{b_1}} = \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right ) $$

In [121]:
grad_b1 = relu(np.dot(W2.T, y_hat - y))
grad_b1

array([[0.        ],
       [0.        ],
       [0.17045858]])

Finally, we will calculate the partial derivative of the loss with respect to $\mathbf{W_1}$, and store it in `grad_W1`.

$$\frac{\partial J}{\partial \mathbf{W_1}} = \rm{ReLU}\left ( \mathbf{W_2^\top} (\mathbf{\hat{y}} - \mathbf{y})\right )\mathbf{x}^\top $$

In [122]:
grad_W1 = np.dot(relu(np.dot(W2.T, y_hat - y)), x.T)
grad_W1

array([[0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.04261464, 0.04261464, 0.        , 0.08522929, 0.        ]])

In [123]:
print(f'V (vocabulary size): {V}')
print(f'N (embedding size / size of the hidden layer): {N}')
print(f'size of grad_W1: {grad_W1.shape} (NxV)')
print(f'size of grad_b1: {grad_b1.shape} (Nx1)')
print(f'size of grad_W2: {grad_W1.shape} (VxN)')
print(f'size of grad_b2: {grad_b2.shape} (Vx1)')

V (vocabulary size): 5
N (embedding size / size of the hidden layer): 3
size of grad_W1: (3, 5) (NxV)
size of grad_b1: (3, 1) (Nx1)
size of grad_W2: (3, 5) (VxN)
size of grad_b2: (5, 1) (Vx1)


#### **Gradient descent**

During the gradient descent phase, we will update the weights and biases by subtracting $\alpha$ times the gradient from the original matrices and vectors, using the following formulas.

\begin{align}
 \mathbf{W_1} &:= \mathbf{W_1} - \alpha \frac{\partial J}{\partial \mathbf{W_1}} \\
 \mathbf{W_2} &:= \mathbf{W_2} - \alpha \frac{\partial J}{\partial \mathbf{W_2}} \\
 \mathbf{b_1} &:= \mathbf{b_1} - \alpha \frac{\partial J}{\partial \mathbf{b_1}} \\
 \mathbf{b_2} &:= \mathbf{b_2} - \alpha \frac{\partial J}{\partial \mathbf{b_2}} \\
\end{align}

First, let set a value for $\alpha$.

In [124]:
alpha = 0.03

The updated weight matrix $\mathbf{W_1}$ will be:

In [125]:
W1_new = W1 - alpha * grad_W1

Let's compare the previous and new values of $\mathbf{W_1}$:

In [126]:
print('old value of W1:')
print(W1)
print()
print('new value of W1:')
print(W1_new)

old value of W1:
[[ 0.41687358  0.08854191 -0.23495225  0.28320538  0.41800106]
 [ 0.32735501  0.22795148 -0.23951958  0.4117634  -0.23924344]
 [ 0.26637602 -0.23846886 -0.37770863 -0.11399446  0.34008124]]

new value of W1:
[[ 0.41687358  0.08854191 -0.23495225  0.28320538  0.41800106]
 [ 0.32735501  0.22795148 -0.23951958  0.4117634  -0.23924344]
 [ 0.26509758 -0.2397473  -0.37770863 -0.11655134  0.34008124]]


The difference is very subtle (the last row), which is why it takes a fair amount of iterations to train the neural network until it reaches optimal weights and biases starting from random values.

Now we will calculate the new values of $\mathbf{W_2}$ (to be stored in `W2_new`), $\mathbf{b_1}$ (in `b1_new`), and $\mathbf{b_2}$ (in `b2_new`).

\begin{align}
 \mathbf{W_2} &:= \mathbf{W_2} - \alpha \frac{\partial J}{\partial \mathbf{W_2}} \\
 \mathbf{b_1} &:= \mathbf{b_1} - \alpha \frac{\partial J}{\partial \mathbf{b_1}} \\
 \mathbf{b_2} &:= \mathbf{b_2} - \alpha \frac{\partial J}{\partial \mathbf{b_2}} \\
\end{align}

In [127]:
W2_new = W2 - alpha * grad_W2
b1_new = b1 - alpha * grad_b1
b2_new = b2 - alpha * grad_b2

print('W2_new')
print(W2_new)
print()
print('b1_new')
print(b1_new)
print()
print('b2_new')
print(b2_new)

W2_new
[[-0.22384758 -0.43362588  0.13310965]
 [ 0.08265956  0.0775535   0.1772054 ]
 [ 0.19557112 -0.04637608 -0.1790735 ]
 [ 0.06855622 -0.02363691  0.36107434]
 [ 0.33251813 -0.3982269  -0.43959196]]

b1_new
[[ 0.09688219]
 [ 0.29239497]
 [-0.27875802]]

b2_new
[[ 0.02964508]
 [-0.36970753]
 [-0.10468778]
 [-0.35349417]
 [-0.0764456 ]]


#### **Extracting word embedding vectors**

We have finished training the neural network. We have three options to get word embedding vectors for the words of our vocabulary, based on the weight matrices $\mathbf{W_1}$ and/or $\mathbf{W_2}$.

### Option 1: extract embedding vectors from $\mathbf{W_1}$

The first option is to take the columns of $\mathbf{W_1}$ as the embedding vectors of the words of the vocabulary, using the same order of the words as for the input and output vectors.

For example $\mathbf{W_1}$ is this matrix:

In [128]:
W1

array([[ 0.41687358,  0.08854191, -0.23495225,  0.28320538,  0.41800106],
       [ 0.32735501,  0.22795148, -0.23951958,  0.4117634 , -0.23924344],
       [ 0.26637602, -0.23846886, -0.37770863, -0.11399446,  0.34008124]])

The first column, which is a 3-element vector, is the embedding vector of the first word of our vocabulary. The second column is the word embedding vector for the second word, and so on.

The first, second, etc. words are ordered as follows.

In [129]:
for i in range(V):
    print(Ind2word[i])

am
because
happy
i
learning


So the word embedding vectors corresponding to each word are:

In [130]:
# loop through each word of the vocabulary
for word in word2Ind:
    # extract the column corresponding to the index of the word in the vocabulary
    word_embedding_vector = W1[:, word2Ind[word]]
    
    print(f'{word}: {word_embedding_vector}')

am: [0.41687358 0.32735501 0.26637602]
because: [ 0.08854191  0.22795148 -0.23846886]
happy: [-0.23495225 -0.23951958 -0.37770863]
i: [ 0.28320538  0.4117634  -0.11399446]
learning: [ 0.41800106 -0.23924344  0.34008124]


### Option 2: extract embedding vectors from $\mathbf{W_2}$

The second option is to take $\mathbf{W_2}$ transpose, and take its columns as the word embedding vectors just like we did for $\mathbf{W_1}$.

In [131]:
W2.T

array([[-0.22182064,  0.08476603,  0.1871551 ,  0.07055222,  0.33480474],
       [-0.43008631,  0.08123194, -0.06107263, -0.02015138, -0.39423389],
       [ 0.13310965,  0.1772054 , -0.1790735 ,  0.36107434, -0.43959196]])

In [132]:
# loop through each word of the vocabulary
for word in word2Ind:
    # extract the column corresponding to the index of the word in the vocabulary
    word_embedding_vector = W2.T[:, word2Ind[word]]
    
    print(f'{word}: {word_embedding_vector}')

am: [-0.22182064 -0.43008631  0.13310965]
because: [0.08476603 0.08123194 0.1772054 ]
happy: [ 0.1871551  -0.06107263 -0.1790735 ]
i: [ 0.07055222 -0.02015138  0.36107434]
learning: [ 0.33480474 -0.39423389 -0.43959196]


### Option 3: extract embedding vectors from $\mathbf{W_1}$ and $\mathbf{W_2}$

The third option uses the average of $\mathbf{W_1}$ and $\mathbf{W_2^\top}$.

In [133]:
W3 = (W1+W2.T)/2
W3

array([[ 0.09752647,  0.08665397, -0.02389858,  0.1768788 ,  0.3764029 ],
       [-0.05136565,  0.15459171, -0.15029611,  0.19580601, -0.31673866],
       [ 0.19974284, -0.03063173, -0.27839106,  0.12353994, -0.04975536]])

Extracting the word embedding vectors works just like the two previous options, by taking the columns of the matrix we have just created.

In [135]:
# loop through each word of the vocabulary
for word in word2Ind:
    # extract the column corresponding to the index of the word in the vocabulary
    word_embedding_vector = W3[:, word2Ind[word]]    
    print(f'{word}: {word_embedding_vector}')

am: [ 0.09752647 -0.05136565  0.19974284]
because: [ 0.08665397  0.15459171 -0.03063173]
happy: [-0.02389858 -0.15029611 -0.27839106]
i: [0.1768788  0.19580601 0.12353994]
learning: [ 0.3764029  -0.31673866 -0.04975536]
