## Embeddings

In our previous example, we worked with high-dimensional bag-of-words vectors of length `vocab_size`, and we explicitly transformed low-dimensional positional representation vectors into sparse one-hot representations. This one-hot representation is not memory-efficient. Moreover, each word is treated as completely independent, meaning one-hot encoded vectors fail to capture semantic relationships between words.

In this unit, we will continue analyzing the **News AG** dataset. To start, let's load the data and retrieve some definitions from the previous unit.


In [2]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

ds_train, ds_test = tfds.load('ag_news_subset').values()

### What's an embedding?

The concept of **embedding** involves representing words as lower-dimensional dense vectors that capture the semantic meaning of the word. We'll discuss how to create meaningful word embeddings later, but for now, think of embeddings as a way to reduce the dimensionality of a word vector.

An embedding layer takes a word as input and outputs a vector of a specified `embedding_size`. In a way, it is similar to a `Dense` layer, but instead of requiring a one-hot encoded vector as input, it can directly take a word number.

By using an embedding layer as the first layer in our network, we can transition from a bag-of-words model to an **embedding bag** model. In this approach, we first convert each word in the text into its corresponding embedding and then apply an aggregate function over all those embeddings, such as `sum`, `average`, or `max`.

![Image showing an embedding classifier for five sequence words.](../../../../../translated_images/embedding-classifier-example.b77f021a7ee67eeec8e68bfe11636c5b97d6eaa067515a129bfb1d0034b1ac5b.en.png)

Our neural network classifier is composed of the following layers:

* A `TextVectorization` layer, which takes a string as input and outputs a tensor of token numbers. We'll define a reasonable vocabulary size `vocab_size` and ignore less frequently used words. The input shape will be 1, and the output shape will be $n$, as we'll obtain $n$ tokens, each represented by numbers ranging from 0 to `vocab_size`.
* An `Embedding` layer, which takes $n$ numbers and maps each number to a dense vector of a specified length (100 in our example). As a result, the input tensor of shape $n$ will be transformed into an $n\times 100$ tensor.
* An aggregation layer, which computes the average of this tensor along the first axis. In other words, it calculates the average of all $n$ input tensors corresponding to different words. This layer will be implemented using a `Lambda` layer, where we'll pass a function to compute the average. The output will have a shape of 100, representing the numeric representation of the entire input sequence.
* A final `Dense` linear classifier.


In [3]:
vocab_size = 30000
batch_size = 128

vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,input_shape=(1,))

model = keras.models.Sequential([
    vectorizer,    
    keras.layers.Embedding(vocab_size,100),
    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),
    keras.layers.Dense(4, activation='softmax')
])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, None, 100)         3000000   
                                                                 
 lambda (Lambda)             (None, 100)               0         
                                                                 
 dense (Dense)               (None, 4)                 404       
                                                                 
Total params: 3,000,404
Trainable params: 3,000,404
Non-trainable params: 0
_________________________________________________________________


In the `summary` printout, in the **output shape** column, the first tensor dimension `None` represents the minibatch size, and the second represents the length of the token sequence. Each token sequence in the minibatch has a different length. We'll cover how to handle this in the next section.

Now let's train the network:


In [4]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

print("Training vectorizer")
vectorizer.adapt(ds_train.take(500).map(extract_text))

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x22255515100>

**Note** that we are building vectorizer based on a subset of the data. This is done in order to speed up the process, and it might result in a situation when not all tokens from our text is present in the vocabulary. In this case, those tokens would be ignored, which may result in slightly lower accuracy. However, in real life a subset of text often gives a good vocabulary estimation.


### Handling variable sequence lengths

Let's explore how training works with minibatches. In the example above, the input tensor has a dimension of 1, and we use minibatches of size 128, making the actual tensor size $128 \times 1$. However, the number of tokens in each sentence varies. When we apply the `TextVectorization` layer to a single input, the number of tokens returned depends on how the text is tokenized:


In [5]:
print(vectorizer('Hello, world!'))
print(vectorizer('I am glad to meet you!'))

tf.Tensor([ 1 45], shape=(2,), dtype=int64)
tf.Tensor([ 112 1271    1    3 1747  158], shape=(6,), dtype=int64)


However, when we apply the vectorizer to several sequences, it has to produce a tensor of rectangular shape, so it fills unused elements with the PAD token (which in our case is zero):


In [6]:
vectorizer(['Hello, world!','I am glad to meet you!'])

<tf.Tensor: shape=(2, 6), dtype=int64, numpy=
array([[   1,   45,    0,    0,    0,    0],
       [ 112, 1271,    1,    3, 1747,  158]], dtype=int64)>

Here we can see the embeddings:


In [7]:
model.layers[1](vectorizer(['Hello, world!','I am glad to meet you!'])).numpy()

array([[[ 1.53059261e-02,  6.80514947e-02,  3.14026810e-02, ...,
         -8.92002955e-02,  1.52911525e-04, -5.65562584e-02],
        [ 2.57456154e-01,  2.79364467e-01, -2.03605562e-01, ...,
         -2.07474351e-01,  8.31158683e-02, -2.03911960e-01],
        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,
         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],
        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,
         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],
        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,
         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02],
        [ 3.98201384e-02, -8.03454965e-03,  2.39790026e-02, ...,
         -7.18549127e-04,  2.66963355e-02, -4.30646613e-02]],

       [[ 1.89674050e-01,  2.61548996e-01, -3.67433839e-02, ...,
         -2.07366899e-01, -1.05442435e-01, -2.36952081e-01],
        [ 6.16133213e-02,  1.80511594e-01,  9.77298319e-02, ...,
         -5.46628237e-02, -1.07340455e-01, -1.06589

**Note**: To minimize the amount of padding, in some cases it makes sense to sort all sequences in the dataset in the order of increasing length (or, more precisely, number of tokens). This will ensure that each minibatch contains sequences of similar length.


## Semantic embeddings: Word2Vec

In our previous example, the embedding layer learned to map words to vector representations, but these representations lacked semantic meaning. It would be ideal to learn a vector representation where similar words or synonyms correspond to vectors that are close to each other based on some vector distance (e.g., Euclidean distance).

To achieve this, we need to pretrain our embedding model on a large text corpus using a technique like [Word2Vec](https://en.wikipedia.org/wiki/Word2vec). This method relies on two main architectures to generate distributed word representations:

 - **Continuous bag-of-words** (CBoW), where the model is trained to predict a word based on its surrounding context. Given the n-gram $(W_{-2},W_{-1},W_0,W_1,W_2)$, the model's goal is to predict $W_0$ using $(W_{-2},W_{-1},W_1,W_2)$.
 - **Continuous skip-gram**, which works in the opposite way to CBoW. Here, the model uses the surrounding context words to predict the current word.

CBoW is faster, but skip-gram, while slower, performs better at representing rare words.

![Image showing both CBoW and Skip-Gram algorithms to convert words to vectors.](../../../../../translated_images/example-algorithms-for-converting-words-to-vectors.fbe9207a726922f6f0f5de66427e8a6eda63809356114e28fb1fa5f4a83ebda7.en.png)

To experiment with the Word2Vec embedding pretrained on the Google News dataset, we can use the **gensim** library. Below, we find the words most similar to 'neural'.

> **Note:** When you first create word vectors, downloading them may take some time!


In [8]:
import gensim.downloader as api
w2v = api.load('word2vec-google-news-300')

In [12]:
for w,p in w2v.most_similar('neural'):
    print(f"{w} -> {p}")

neuronal -> 0.7804799675941467
neurons -> 0.7326500415802002
neural_circuits -> 0.7252851724624634
neuron -> 0.7174385190010071
cortical -> 0.6941086649894714
brain_circuitry -> 0.6923246383666992
synaptic -> 0.6699118614196777
neural_circuitry -> 0.6638563275337219
neurochemical -> 0.6555314064025879
neuronal_activity -> 0.6531826257705688


We can also extract the vector embedding from the word, to be used in training the classification model. The embedding has 300 components, but here we only show the first 20 components of the vector for clarity:


In [13]:
w2v['play'][:20]

array([ 0.01226807,  0.06225586,  0.10693359,  0.05810547,  0.23828125,
        0.03686523,  0.05151367, -0.20703125,  0.01989746,  0.10058594,
       -0.03759766, -0.1015625 , -0.15820312, -0.08105469, -0.0390625 ,
       -0.05053711,  0.16015625,  0.2578125 ,  0.10058594, -0.25976562],
      dtype=float32)

The great thing about semantic embeddings is that you can manipulate the vector encoding based on semantics. For example, we can ask to find a word whose vector representation is as close as possible to the words *king* and *woman*, and as far as possible from the word *man*:


In [14]:
w2v.most_similar(positive=['king','woman'],negative=['man'])[0]

('queen', 0.7118192911148071)

An example above uses some internal GenSym magic, but the underlying logic is actually quite simple. An interesting thing about embeddings is that you can perform normal vector operations on embedding vectors, and that would reflect operations on word **meanings**. The example above can be expressed in terms of vector operations: we calculate the vector corresponding to **KING-MAN+WOMAN** (operations `+` and `-` are performed on vector representations of corresponding words), and then find the closest word in the dictionary to that vector:


In [15]:
# get the vector corresponding to kind-man+woman
qvec = w2v['king']-1.7*w2v['man']+1.7*w2v['woman']
# find the index of the closest embedding vector 
d = np.sum((w2v.vectors-qvec)**2,axis=1)
min_idx = np.argmin(d)
# find the corresponding word
w2v.index_to_key[min_idx]

'queen'

> **NOTE**: We had to add small coefficients to the *man* and *woman* vectors - try removing them to see what happens.

To identify the closest vector, we utilize TensorFlow tools to calculate a vector of distances between our vector and all the vectors in the vocabulary, and then use `argmin` to find the index of the word with the smallest distance.


While Word2Vec appears to be an excellent method for capturing word semantics, it has several drawbacks, including the following:

* Both CBoW and skip-gram models are **predictive embeddings**, meaning they only consider local context. Word2Vec does not utilize global context.
* Word2Vec does not account for word **morphology**, i.e., the fact that the meaning of a word can depend on its components, such as the root.

**FastText** aims to address the second limitation by extending Word2Vec. It learns vector representations for each word as well as the character n-grams within each word. During training, these representations are averaged into a single vector at each step. Although this adds significant computational overhead during pretraining, it allows word embeddings to incorporate sub-word information.

Another approach, **GloVe**, takes a different route to word embeddings by relying on the factorization of the word-context matrix. It first constructs a large matrix that counts word occurrences across various contexts, then attempts to represent this matrix in lower dimensions while minimizing reconstruction loss.

The gensim library supports these word embedding methods, and you can experiment with them by modifying the model loading code above.


## Using pretrained embeddings in Keras

We can adapt the example above to initialize the matrix in our embedding layer with semantic embeddings, such as Word2Vec. The vocabularies of the pretrained embedding and the text corpus likely won't match, so we need to decide which one to use. Here, we explore two possible approaches: using the tokenizer vocabulary or using the vocabulary from the Word2Vec embeddings.

### Using tokenizer vocabulary

When using the tokenizer vocabulary, some words in the vocabulary will have corresponding Word2Vec embeddings, while others will not. Assuming our vocabulary size is `vocab_size` and the Word2Vec embedding vector length is `embed_size`, the embedding layer will be represented by a weight matrix with dimensions `vocab_size`$\times$`embed_size`. We will fill this matrix by iterating through the vocabulary:


In [9]:
embed_size = len(w2v.get_vector('hello'))
print(f'Embedding size: {embed_size}')

vocab = vectorizer.get_vocabulary()
W = np.zeros((vocab_size,embed_size))
print('Populating matrix, this will take some time...',end='')
found, not_found = 0,0
for i,w in enumerate(vocab):
    try:
        W[i] = w2v.get_vector(w)
        found+=1
    except:
        # W[i] = np.random.normal(0.0,0.3,size=(embed_size,))
        not_found+=1

print(f"Done, found {found} words, {not_found} words missing")

Embedding size: 300
Populating matrix, this will take some time...Done, found 4551 words, 784 words missing


For words that are not present in the Word2Vec vocabulary, we can either leave them as zeroes, or generate a random vector.

Now we can define an embedding layer with pretrained weights:


In [10]:
emb = keras.layers.Embedding(vocab_size,embed_size,weights=[W],trainable=False)
model = keras.models.Sequential([
    vectorizer, emb,
    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),
    keras.layers.Dense(4, activation='softmax')
])

In [11]:
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),
          validation_data=ds_test.map(tupelize).batch(batch_size))



<keras.callbacks.History at 0x2220226ef10>

> **Note**: Notice that we set `trainable=False` when creating the `Embedding`, which means that we're not retraining the Embedding layer. This may cause accuracy to be slightly lower, but it speeds up the training.

### Using embedding vocabulary

One issue with the previous approach is that the vocabularies used in the TextVectorization and Embedding are different. To overcome this problem, we can use one of the following solutions:
* Re-train the Word2Vec model on our vocabulary.
* Load our dataset with the vocabulary from the pretrained Word2Vec model. Vocabularies used to load the dataset can be specified during loading.

The latter approach seems easier, so let's implement it. First of all, we will create a `TextVectorization` layer with the specified vocabulary, taken from the Word2Vec embeddings:


In [12]:
vocab = list(w2v.vocab.keys())
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(input_shape=(1,))
vectorizer.set_vocabulary(vocab)

The gensim word embeddings library contains a convenient function, `get_keras_embeddings`, which will automatically create the corresponding Keras embeddings layer for you.


In [13]:
model = keras.models.Sequential([
    vectorizer, 
    w2v.get_keras_embedding(train_embeddings=False),
    keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1)),
    keras.layers.Dense(4, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(128),validation_data=ds_test.map(tupelize).batch(128),epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x2220ccb81c0>

One of the reasons we're not seeing higher accuracy is because some words from our dataset are missing in the pretrained GloVe vocabulary, and thus they are essentially ignored. To overcome this, we can train our own embeddings based on our dataset.


## Contextual embeddings

One major drawback of traditional pretrained embedding representations like Word2Vec is that, while they can capture some sense of a word's meaning, they fail to distinguish between different meanings. This can lead to issues in downstream models.

For instance, the word 'play' has distinct meanings in these two sentences:
- I went to a **play** at the theater.
- John wants to **play** with his friends.

The pretrained embeddings we mentioned earlier represent both meanings of the word 'play' using the same embedding. To address this limitation, we need to create embeddings based on the **language model**, which is trained on a large text corpus and *understands* how words can be used in various contexts. While discussing contextual embeddings is beyond the scope of this tutorial, we will revisit them when we explore language models in the next unit.



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
