<a href="https://colab.research.google.com/github/LolitaSian/Getting-Started-with-Google-BERT/blob/main/Chapter03/3.04.%20Extracting%20embeddings%20from%20all%20encoder%20layers%20of%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting embeddings from all encoder layers of BERT
We learned how to extract the embedding from the pre-trained BERT in the previous section. We learned that they are the embeddings obtained from the final encoder layer. Now the question is should we consider the embedding obtained only from the final encoder layer (final hidden state), or should we also consider the embedding obtained from all the encoder layers (all hidden states)? Let's explore more about this. 

Let us represent the input embedding layer by $h_0$  and the first encoder layer (first hidden layer) by $h_1$, second encoder layer (second hidden layer) by $h_2$ and so on to the final twelfth encoder layer by $h_{12}$ as shown in the following figure:


![title](https://github.com/LolitaSian/Getting-Started-with-Google-BERT/blob/main/Chapter03/images/4.png?raw=1)


Instead of taking the embeddings (representation) only from the final encoder layer, the researchers of the BERT have experimented with taking embeddings from different encoder layers.

For instance, for a named-entity recognition task, the researchers have used the pre-trained BERT for extracting features. Instead of using the embedding only from the final encoder layer (final hidden layer) as a feature, they have experimented using embedding  from other encoder layers (other hidden layers) as a feature and obtained the following F1 score: 


![title](https://github.com/LolitaSian/Getting-Started-with-Google-BERT/blob/main/Chapter03/images/5.png?raw=1)

As we can observe from the preceding table, concatenating the embeddings of the last 4 encoder layers (last 4 hidden layers) gives us a greater F1 score of 96.1% in the  NER task. Thus, instead of taking the embeddings only from the final encoder layer (final hidden layer), we can also use embeddings from the other encoder layers.

Now, we will learn how to extract the embeddings from all the encoder layers using the transformers library. 

## Extracting the embeddings 
First, let us import the necessary modules: 

In [20]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [21]:
from transformers import BertModel, BertTokenizer
import torch


Next, download the pre-trained BERT model and tokenizer. As we can notice while downloading the pre-trained BERT model. We need to set output_hidden_states = True. By setting this to true helps us to obtain embeddings from all the encoder layers: 


In [22]:
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Next, we preprocess the input before feeding it to the model. 

## Preprocess the input
Let's consider the same sentence we saw in the previous section. First, we tokenize the sentence and add [CLS] token at the beginning and [SEP] token at the end: 


In [23]:
sentence = 'I love Paris'
tokens = tokenizer.tokenize(sentence)
tokens = ['[CLS]'] + tokens + ['[SEP]']


Suppose, we need to keep the token length to 7. So, we add the [PAD] tokens and also define the attention mask: 

In [24]:
tokens = tokens + ['[PAD]'] + ['[PAD]']
attention_mask = [1 if i!= '[PAD]' else 0 for i in tokens]


Next, we convert the tokens to their token_ids:

In [25]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)




Now, we convert the token_ids and attention_mask to tensor: 

In [26]:
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)



Now that we preprocessed the input, let's get the embedding. 

## Getting the embedding 
Since we set output_hidden_states = True while defining the model for getting the embeddings from all the encoder layers, now the model returns an output tuple with three values as shown below:


In [27]:
output = model(token_ids, attention_mask = attention_mask)
last_hidden_state = output.last_hidden_state
pooler_output = output.pooler_output
hidden_states = output.hidden_states


In the preceding code, the following applies: 

The first value last_hidden_state contains the representation of all the tokens obtained only from the final encoder layer (encoder 12). 
Next, pooler_output indicates the representation of the [CLS] token from the final encoder layer which is further processed by a linear and tanh activation function. 
hidden_states contains the representation of all the tokens obtained from all the final encoder layers. 
Now, let us take a look into each of these values and understand them in more detail. 

First, let us look at last_hidden_state. As we learned, it holds the representation of all the tokens obtained only from the final encoder layer (encoder 12). Let us print the shape of the last_hidden_state: 


In [28]:
last_hidden_state.shape

torch.Size([1, 7, 768])


The size [1,7,768] indicates the[batch_size, sequence_length, hidden_size].

Our batch size is 1, the sequence length is the token length and since we have 7 tokens the sequence length is 7, and the hidden size is the representation (embedding) size and it is 768 for the BERT-base model. 

We can obtain the embedding of each token as: 
- last_hidden[0][0] gives the representation of the first token which is [CLS]
- last_hidden[0][1] gives the representation of the second token which is 'I' 
- last_hidden[0][2] gives the representation of the third token which is 'love' 

Similarly, we can obtain the representation of all the tokens from the final encoder layer. 

Next, we have pooler_output which contains the representation of the [CLS] token from the final encoder layer which is further processed by a linear and tanh activation function. Let us print the shape of the pooler_output: 


In [29]:
pooler_output.shape

torch.Size([1, 768])


The size [1,768] indicates the[batch_size, hidden_size].

We learned that [CLS] token holds the aggregate representation of the sentence. Thus, we can use the pooler_output as the representation of the given sentence 'I love Paris'. 

Finally, we have hidden_states and it contains the representation of all the tokens obtained from all the final encoder layers. It is a tuple containing 13 values holding the representation of all encoder layers (hidden layers) starting from the input embedding layer  to the final encoder layer . 


In [30]:
len(hidden_states)

13


As we can notice, it contains 13 values holding the representation of all layers. Thus: 

- hidden_states[0] contains the representation of all the tokens obtained from the input embedding layer 
- hidden_states[1] contains the representation of all the tokens obtained from the first encoder layer 
- hidden_states[2] contains the representation of all the tokens obtained from the second encoder layer 

Similarly, hidden_states[12] contains the representation of all the tokens obtained from the final encoder layer 
Let's explore this more. First, let's print the shape of the hidden_states[0] which contains the representation of all the tokens obtained from the input embedding layer : 


In [31]:
hidden_states[0].shape

torch.Size([1, 7, 768])


The size [1,7,768] indicates the[batch_size, sequence_length, hidden_size].

Now, let's print the shape of hidden_states[1] which contains the representation of all tokens obtained from the first encoder layer : 


In [32]:
torch.Size([1, 7, 768])

torch.Size([1, 7, 768])

In [33]:
hidden_states[1].shape

torch.Size([1, 7, 768])


Thus, in this way, we can obtain the embedding of tokens from all the encoder layers. We learned how to use the pre-trained BERT to extract embeddings, can we also use pre-trained BERT for a downstream task like sentiment analysis? Yes! We will learn about this in the next section. 