# Exploring Multilingual BERT models

Until now, we have discussed about how BERT works and studied all about its variants. But we must note that we only ever applied BERT to English.
<br><br>
_Did you really think computers only speak English?_
<br><br>
Of course, not.
<br><br>
BERT is a deep learning model, and it will work with the language that its trained. Give it volumes of data in English, it speaks English. Give it volumes of data in Tamil, the BERT will speak in Tamil!


<br><br>

**Contents of this Notebook:**

Topic 1: Multilingual BERT (M-BERT)<br>
Topic 2: Exploring language-specific BERT models<br>
Topic 3: Seeing XLM-RoBERTa
<br><br>

### Topic 1: Multilingual BERT (M-BERT)
<br>
The BERT model pretrained in English corpus, so naturally it provides representation for only English text. Suppose we have input text in say, Tamil or Hindi. Here, we use M-BERT to extract embeddings. Just like BERT, M-BERT has been trained with masked language modeling (MLM) and next sentence prediction (NSP) tasks using Wikipedia text of over 104 languages. We have not changed anything with respect to the original model. We just trained it with multiple language texts.
<br><br>
Now, let's see the M-BERT in action.
<br>
<br><br>

You know the drill. Install transformers library first!

In [1]:
!pip install transformers



You should consider upgrading via the 'C:\Users\susin\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.


Let us now use the Transformers pipeline to extract embeddings from M-BERT for a sentence in German.

In [None]:
from transformers import BertModel, BertTokenizer

#Setting the model
model = BertModel.from_pretrained('bert-base-multilingual-cased')

#Setting the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

#The sentence 'I love this workshop' in German
sentence = "Ich liebe diese Werkstatt"

#Tokenizing the sentence, with tensor outputs
inputs = tokenizer(sentence, return_tensors = "pt")

#The action unfolds here
outputs = model(**inputs)

Time to analyze! Let's see what each command gave us.

In [None]:
 #The tokenizer output's first field is the input token ids.
 #These are the token ids for German tokens.
 
inputs['input_ids'].tolist()[0]

[101, 21023, 56147, 11044, 12750, 109871, 102]

In [None]:
#We converted the above list of token IDs to tokens, so you can see
#how the sentence is tokenized. Notice the word 'liebe' is seperated
#to two sub-words, 'lie' and '##be'.

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].tolist()[0])
tokens

['[CLS]', 'Ich', 'lie', '##be', 'diese', 'Werkstatt', '[SEP]']

In [None]:
#This is the output given by the tokenizer. It has the input_ids,
#attention_mask, and token_type_ids.

inputs

{'input_ids': tensor([[   101,  21023,  56147,  11044,  12750, 109871,    102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [None]:
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0876, -0.0718,  0.0414,  ...,  0.4946, -0.0276, -0.0142],
         [-0.3443, -0.0649,  0.5519,  ...,  0.9238, -0.2649, -0.3662],
         [-0.3084,  0.3588, -1.3955,  ...,  0.5014, -0.1821,  0.2935],
         ...,
         [-0.2598, -0.4356, -0.7024,  ...,  0.5763,  0.1468, -0.0111],
         [-0.3522,  0.0159, -0.2131,  ...,  0.6841,  0.2620, -0.1083],
         [ 0.0884, -0.2654,  0.5618,  ...,  0.6821,  0.0422,  0.0561]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 3.1423e-01, -8.6470e-02,  3.0596e-01, -2.2895e-01, -7.6184e-02,
          4.4001e-01,  2.4908e-01,  1.9022e-01, -4.7880e-01,  3.7212e-01,
          2.9647e-03, -1.8341e-01, -6.4193e-02, -1.6221e-01,  2.5880e-01,
         -1.8333e-01,  5.9404e-01,  2.1045e-01,  1.4865e-01, -3.9375e-01,
         -9.9999e-01, -2.4089e-01, -3.2763e-01, -8.7056e-02, -3.2658e-01,
          1.3667e-01, -1.7459e-01,  1.5809e-03,  1.8092e-01, -8.387

In [None]:
#The last hidden state output of M-BERT for the sample sentence.
outputs['last_hidden_state']

tensor([[[ 0.0876, -0.0718,  0.0414,  ...,  0.4946, -0.0276, -0.0142],
         [-0.3443, -0.0649,  0.5519,  ...,  0.9238, -0.2649, -0.3662],
         [-0.3084,  0.3588, -1.3955,  ...,  0.5014, -0.1821,  0.2935],
         ...,
         [-0.2598, -0.4356, -0.7024,  ...,  0.5763,  0.1468, -0.0111],
         [-0.3522,  0.0159, -0.2131,  ...,  0.6841,  0.2620, -0.1083],
         [ 0.0884, -0.2654,  0.5618,  ...,  0.6821,  0.0422,  0.0561]]],
       grad_fn=<NativeLayerNormBackward0>)

In [None]:
# Shape of the final hidden state. One 768-dimensional vector for each token.
# 7 tokens, 7 such vectors.

outputs['last_hidden_state'].shape

torch.Size([1, 7, 768])

### Understanding XLM-R
<br>
XLM-R stands for XLM-RoBERTa! It is the current state-of-the-art for learning cross-language representation. XLM-R is trained on a huge dataset whose size is 2.5 TB. The dataset is obtained by filtering the unlabeled text of 100 languages from the CommonCrawl dataset. Let us visit RoBERTa again, this time trained with an objective as cross-lingual language model (XLM).
<br><br>




In [None]:
from transformers import AutoTokenizer, AutoModel

# Setting the tokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# Setting the model
model = AutoModel.from_pretrained("xlm-roberta-base")

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
#The sentence 'I love this workshop' in German
sentence = "Ich liebe diese Werkstatt"

#Tokenizing the sentence, with tensor outputs
inputs = tokenizer(sentence, return_tensors = "pt")

#The action unfolds here
outputs = model(**inputs)

In [None]:
outputs['last_hidden_state']

tensor([[[ 0.0626,  0.0890,  0.0571,  ..., -0.0692,  0.0796,  0.0133],
         [-0.0579,  0.0133, -0.0094,  ...,  0.1398,  0.0549,  0.2249],
         [-0.0974,  0.0820, -0.0547,  ..., -0.0218,  0.0064,  0.1438],
         ...,
         [-0.0935,  0.0314,  0.0554,  ...,  0.1775,  0.0634,  0.3032],
         [-0.0125, -0.0217,  0.0659,  ...,  0.0942,  0.0067,  0.3934],
         [ 0.0488,  0.0805, -0.0021,  ..., -0.1535,  0.0045,  0.0593]]],
       grad_fn=<NativeLayerNormBackward0>)

In [None]:
outputs['last_hidden_state'].shape

torch.Size([1, 7, 768])

Let's see the XLM-RoBERTa model in action now. We will test it out in masked language modelling, the same task in which it was pre-trained.

In [None]:
from transformers import pipeline

# We use a transformers pipeline to apply the model directly to the task of our
# liking.

unmasker = pipeline('fill-mask', model='xlm-roberta-base')

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Let's play with it!

Since XLM-R is trained on Masked Language Modelling, the base model (that we imported here now) can be tested with the same task. In the code cell below, the unmasker contains the french sentence for 'Hello I am a good person'. But, here's the twist. The 'good' here is masked. That is, the input is 'Hello I am a &#60;mask&#62; person' in French. Find out what the model guesses!

In [None]:
unmasker("नमस्ते मैं एक <mask> इंसान हूँ")

[{'score': 0.3160252571105957,
  'token': 80408,
  'token_str': 'साधारण',
  'sequence': 'नमस्ते मैं एक साधारण इंसान हूँ'},
 {'score': 0.12316010147333145,
  'token': 26046,
  'token_str': 'आम',
  'sequence': 'नमस्ते मैं एक आम इंसान हूँ'},
 {'score': 0.0838639959692955,
  'token': 38338,
  'token_str': 'सामान्य',
  'sequence': 'नमस्ते मैं एक सामान्य इंसान हूँ'},
 {'score': 0.02690257504582405,
  'token': 166498,
  'token_str': 'असल',
  'sequence': 'नमस्ते मैं एक असल इंसान हूँ'},
 {'score': 0.02464522421360016,
  'token': 99832,
  'token_str': 'महान',
  'sequence': 'नमस्ते मैं एक महान इंसान हूँ'}]

Now that you've seen in French, below is a list of the same masked sentence in four more languages. Let's see what it has to say!
<br><br>

_Vietnamese_ - xin chào tôi là một người &#60;mask&#62;<br>
_Dutch_ - hallo ik ben een &#60;mask&#62; mens<br>
_German_ - Hallo, ich bin eine &#60;mask&#62; Person<br>
_Hindi_ - नमस्ते मैं एक &#60;mask&#62; इंसान हूँ<br>
_Tamil_ - வணக்கம் நான் ஒரு &#60;mask&#62; மனிதர்
<br>


### Language-specific models

<br>

If training BERT in English is possible, training in multiple languages is possible, then it is obviously possible to pre-train it in our target language.

E voila!


*   FlauBERT for French
*   BETO for Spanish
*   BERTje for Dutch
*   German BERT
*   Chinese BERT
*   Japanese BERT
*   FinBERT for Finnish
*   UmBERTo for Italian (🤌)
*   BERTimbay for Portuguese
*   RuBERT for Russian

<br><br>
Let us try FlauBERT and try to extract embeddings from it.




In [None]:
from transformers import FlaubertModel, FlaubertTokenizer
import torch

In [None]:
# Setting the model

model = FlaubertModel.from_pretrained('flaubert/flaubert_base_cased')

Downloading:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/553M [00:00<?, ?B/s]

Some weights of the model checkpoint at flaubert/flaubert_base_cased were not used when initializing FlaubertModel: ['pred_layer.proj.bias', 'pred_layer.proj.weight']
- This IS expected if you are initializing FlaubertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing FlaubertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Setting the tokenizer

tokenizer = FlaubertTokenizer.from_pretrained('flaubert/flaubert_base_cased')

Downloading:   0%|          | 0.00/1.56M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/896k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
sentence = "Paris est ma ville préférée"

# Tokenizer output
token_ids = tokenizer(sentence, return_tensors = "pt")

In [None]:
# Token IDs are converted to tokens for visualising how it is separated.

tokens = tokenizer.convert_ids_to_tokens(token_ids['input_ids'].tolist()[0])
tokens

['<s>', 'Paris</w>', 'est</w>', 'ma</w>', 'ville</w>', 'préférée</w>', '</s>']

Notice how the tokens look different here? We dont have a [CLS] and a [SEP] token here. This is because the FlauBERT Tokenizer, unlike vanilla BERT, M-BERT and roBERTa, uses another tokenization algorithm, called byte-pair encoding. Read more about it how it works for FlauBERT.

In [None]:
# Getting the representation for the tokens.

representation = model(**token_ids)

In [None]:
representation['last_hidden_state']

tensor([[[-1.7865,  0.9655,  0.0715,  ..., -2.1119, -0.6817,  2.0862],
         [-1.8693,  1.2849, -0.0500,  ..., -2.4135, -0.6691,  2.3915],
         [-2.3948,  0.9265, -0.0772,  ..., -2.9410, -0.7402,  2.6822],
         ...,
         [-2.6473,  1.2281, -0.2248,  ..., -2.6515, -0.7952,  2.7135],
         [-2.4451,  1.1875, -0.4932,  ..., -2.4650, -0.6839,  2.8343],
         [-2.0650,  1.2847, -0.5558,  ..., -2.2921, -0.5243,  2.5505]]],
       grad_fn=<MulBackward0>)

In [None]:
# Shape of the representation

representation['last_hidden_state'].shape

torch.Size([1, 7, 768])

<br>
In this notebook, we looked at how we can obtain representations of other languages using different variants of BERT. The introduction of multilingual BERTs shattered notions of technologies only catering to the English-speaking population. Now, you can fine-tune these models to produce Question Answering systems, Chatbots, summarizers in our own mother tongue, and help our local communities with technologies they never fathomed they would use. I encourage you to look further into this and think innovatively.