# Extracting embeddings with ALBERT
With Hugging Face transformers, we can use the ALBERT model just like how we used BERT. Let's explore this with a small example. Suppose, we need to get the contextual word embedding of every word in the sentence Paris is a beautiful city. Let's see how to that with ALBERT. 

Import the necessary modules: 

In [1]:
!pip install transformers==3.5.1



In [1]:
from transformers import AlbertTokenizer, AlbertModel


Download and load the pre-trained Albert model and tokenizer. In this tutorial, we use the ALBERT-base model: 


In [2]:
model = AlbertModel.from_pretrained('albert-base-v2')
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertModel: ['predictions.bias', 'predictions.LayerNorm.bias', 'predictions.decoder.weight', 'predictions.LayerNorm.weight', 'predictions.dense.bias', 'predictions.dense.weight', 'predictions.decoder.bias']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)ve/main/spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]


Now, feed the sentence to the tokenizer and get the preprocessed input: 

In [3]:
sentence = "Paris is a beautiful city" 
inputs = tokenizer(sentence, return_tensors="pt")


Let's print the inputs:

In [4]:
print(inputs)

{'input_ids': tensor([[   2, 1162,   25,   21, 1632,  136,    3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}



Now we just feed the inputs to the model and get the result. The model returns the hidden_rep which contains the hidden state representation of all the tokens from the final encoder layer and cls_head which contains the hidden state representation of the [CLS] token from the final encoder layer:


In [5]:
hidden_rep, cls_head = model(**inputs)



We can obtain the contextual word embedding of each word in the sentence just like BERT as:

- hidden_rep[0][0] contains the contextual embedding of the token [CLS]
- hidden_rep[0][1] contains the contextual embedding of the token 'Paris' 
- hidden_rep[0][2] contains the contextual embedding of the token 'is' 

Similarly in this manner, hidden_rep[0][7] contains the contextual embedding of the token 'city'. 

In this way, we can use the ALBERT model just like how we used the BERT model. We can also fine-tune the ALBERT model similar to how we fine-tuned the BERT model on any downstream task. Now that we learned how ALBERT works, in the next section, let us explore RoBERTa, another interesting variant of BERT.