## Understanding multilingual BERT 

BERT gives the representation for only the English text. Let's suppose we have an input text in a different language, say, French, now how we can use the BERT for obtaining the representation of the French text? Here is where we use an M-BERT. 

The multilingual BERT shortly known as M-BERT is used for obtaining the representation of text in different languages and not just English. We learned that the BERT model is trained with masked language modeling and next sentence prediction tasks using the English Wikipedia text and the Toronto BookCorpus. Similar to BERT, the M-BERT is also trained with masked language modeling and next sentence prediction tasks but instead of using the Wikipedia text of only English language, M-BERT is trained using the Wikipedia text of 104 different languages. 

But the question is, the size of the Wikipedia text for some languages would be higher than the other right? Yes! the size of Wikipedia text would be large for high-resource languages like English compared to low-resource languages like Swahili. If we train our model with this dataset then it will lead to the problem of overfitting. To avoid overfitting, we use sampling methods. We apply under sampling for high-resource languages and over-sampling for low-resource languages. 

Since the M-BERT is trained over Wikipedia text of 104 different languages, it learns the general syntactic structure of different languages. The M-BERT consists of 110K shared WordPiece vocabulary across all the 104 languages. 

The M-BERT understands the context from different languages without any paired or language aligned training data. It is important to note that we have not trained M-BERT with any cross-lingual objective, it is trained just like how we trained the BERT model. The M-BERT produces a representation that generalizes across multiple languages for downstream tasks. 

The pre-trained M-BERT model is open-sourced by Google and it can be downloaded from here - https://github.com/google-research/bert/blob/master/multilingual.md. The various configurations of pre-trained M-BERT models provided by Google are given in the following: 

- BERT-base, Multilingual cased 
- BERT-base, Multilingual uncased

Both of the preceding models consist of 12 encoder layers, 12 attention heads, 768 hidden zie. It consists of a total of 110 million parameters. 

The pre-trained M-BERT is also compatible with the Hugging Face's transformers library. So, we can use it with the transformers library just like how we use the BERT. Let us see how to use the pre-trained M-BERT model and obtain the sentence representation: 

First, let's import the necessary modules:

In [1]:
%%capture
!pip install transformers==3.5.1

In [2]:
from transformers import BertTokenizer, BertModel


Download and load the pre-trained M-BERT model:

In [3]:
model = BertModel.from_pretrained('bert-base-multilingual-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=714314041.0, style=ProgressStyle(descri…






Download and load the pre-trained M-BERT model's tokenizer: 

In [4]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…






Define the input sentence. Let us French sentence as an input: 


In [5]:
sentence = "C'est une si belle journée"


Tokenize the sentence and get the tokens:

In [6]:
inputs = tokenizer(sentence, return_tensors="pt")


Feed the tokens to the model and get the representation: 

In [7]:
hidden_rep, cls_head = model(**inputs)


The hidden_rep contains the representation of all the tokens in our sentence and the cls_head contains the representation of the [CLS] token which holds the aggregate representation of the sentence.


In this way, we can use the pre-trained M-BERT just like other BERT models. We can use it for fine-tuning the downstream tasks. Now that we have understood how M-BERT works, in the next section, we will evaluate them. 