# Fine-tuning a MLM (Masked Language Model) like BERT (base or large) with the library adapter-transformers (notebook version)

- **Credit**: [Hugging Face](https://huggingface.co/) and [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)
- **Author**: [Pierre GUILLOU](https://www.linkedin.com/in/pierreguillou/)
- **Date**: 01/07/2021
- **Blog post**: []()
- **Link to the folder in github with this notebook and all necessary scripts**: [language-modeling with adapters](https://github.com/piegu/language-models/tree/master/adapters/language-modeling/)
- **Link to the adapters in the AdapterHub**: 

## 1. Context

### Objective

The objective here is to **fine-tune a Masked Language Model (MLM) like BERT (base or large) by training adapters (library [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)), not the embeddings and transformers layers of the MLM model**, and to compare results with BERT model fully fine-tune for the same task.

The interest is obvious: if you need models for different NLP tasks, instead of fine-tuning and storing one model by NLP task, **you store only one MLM model and the trained tasks adapters which sizes are about 3% of the MLM model one**. More, the loading of these adapters in production is very easy.

### Content

In this notebook, we'll see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model on a language modeling tasks. We will cover one type of language modeling tasks which is:

- Masked language modeling: the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

![Widget inference representing the masked language modeling task](images/masked_language_modeling_adapter.png)

We will see how to easily load and preprocess the dataset for each one of those tasks, and how to use the `Trainer` API to fine-tune a model on it.

### History and Credit

This notebook is an adaptation of the following notebooks and scripts for **fine-tuning a (transformer) Masked Language Model (MLM) like BERT (base or large) with any dataset** (we use here the texts of the [Portuguese Squad 1.1 dataset](https://forum.ailab.unb.br/t/datasets-em-portugues/251/4)):
- **from [adapter-transformers](https://github.com/Adapter-Hub/adapter-transformers)** | notebook [01_Adapter_Training.ipynb](https://github.com/Adapter-Hub/adapter-transformers/blob/master/notebooks/01_Adapter_Training.ipynb) and script [run_mlm.py](https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/language-modeling/run_mlm.py) (this script was adapted from the script [run_mlm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) of HF)
- **from [transformers](https://github.com/huggingface/transformers) of Hugging Face** | notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) and script [run_mlm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) 

In order to speed up the fine-tuning of the model on only one GPU, the library [DeepSpeed](https://www.deepspeed.ai/) could be used by applying the configuration provided by HF in the notebook [transformers + deepspeed CLI](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) but as the library adapter-transformers is not synchronized with the last version of the library transformers of HF, we keep that option for the future.

*Note: the paragraph about Causal language modeling (CLM) is not included in this notebook, and all the non necessary code about Masked Model Language (MLM) has been deleted from the original notebook.*

### Major changes from original notebooks and scripts

The notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb) and script [run_mlm.py](https://github.com/Adapter-Hub/adapter-transformers/blob/master/examples/language-modeling/run_mlm.py) allow to evaluate the model performance against the validation loss at the end of each epoch, not against the metric accuracy. 

As a metric is better in order to select a model than the loss, we decided to update this notebook with the metric accuracy for model evaluation.

Thus, we updated the notebook [language_modeling.ipynb](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)  to [language_modeling_adapter.ipynb](https://github.com/piegu/language-models/blob/master/adapters/language_modeling/language_modeling_adapter.ipynb) with the following changes:
- **Accuracy**: model evaluation through eval accuracy
- **EarlyStopping** by selecting the model with the highest eval accuracy (patience of 3 before ending the training)
- **MAD-X 2.0** that allows not to train adapters in the last transformer layer (read page 6 of [UNKs Everywhere: Adapting Multilingual Language Models to New Scripts](https://arxiv.org/pdf/2012.15562.pdf))

## 2. Installation

In [1]:
import pathlib
from pathlib import Path

#root path
root = Path.cwd()

In [2]:
import pickle
import pandas as pd
import numpy as np
import random

In [3]:
import sys; print('python:',sys.version)

import torch; print('Pytorch:',torch.__version__)

import transformers; print('adapter-transformers:',transformers.__version__)
import transformers; print('HF transformers:',transformers.__hf_version__)
import tokenizers; print('tokenizers:',tokenizers.__version__)
import datasets; print('datasets:',datasets.__version__)

# import deepspeed; print('deepspeed:',deepspeed.__version__)

# Versions used in the virtuel environment of this notebook:

# python: 3.8.10 (default, Jun  4 2021, 15:09:15) 
# [GCC 7.5.0]
# Pytorch: 1.9.0
# adapter-transformers: 2.0.1
# transformers: 4.5.1
# tokenizers: 0.10.3
# datasets: 1.8.0

python: 3.8.10 (default, Jun  4 2021, 15:09:15) 
[GCC 7.5.0]
Pytorch: 1.9.0
adapter-transformers: 2.0.1
HF transformers: 4.5.1
tokenizers: 0.10.3
datasets: 1.8.0


## 3. Model & dataset

In [4]:
# Select a MLM BERT base or large in the dataset language
model_checkpoint = "neuralmind/bert-base-portuguese-cased"
# model_checkpoint = "neuralmind/bert-large-portuguese-cased"

# SQuAD 1.1 in Portuguese
dataset_name = "squad11pt" # SQuAD v1.1 em português

## 4. Main hyperparameters

In [5]:
task = "mlm"

In [6]:
# training arguments
batch_size = 32
gradient_accumulation_steps = 1

learning_rate = 1e-4
num_train_epochs = 100.
early_stopping_patience = 5

adam_epsilon = 1e-6

fp16 = True
ds = False # DeepSpeed

# best model
load_best_model_at_end = True 
if load_best_model_at_end:
    metric_for_best_model = "loss" # could be accuracy, too
    if metric_for_best_model == "accuracy":
        greater_is_better = True
    else:
        greater_is_better = False

In [7]:
# train adapter
train_adapter = True # we want to train an adapter
load_adapter = None # we do not upload an existing adapter 
load_lang_adapter = None # we do not upload an existing lang adapter

# if True, do not put adapter in the last transformer layer
madx2 = True

## 5. Configuration

### GPU

In [8]:
# gpu
n_gpu = 1 # train on just one GPU
gpu = 0 # select the GPU

In [9]:
# Run this notebook in GPU 0
# As we do not launch a python script in this notebook, this cell is not mandatory
import os
os.environ['MASTER_ADDR'] = 'localhost'
if gpu == 0:
    os.environ['MASTER_PORT'] = '9996' # modify if RuntimeError: Address already in use # GPU 0
elif gpu == 1:
    os.environ['MASTER_PORT'] = '9995'
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = str(gpu)
os.environ['WORLD_SIZE'] = "1"

### Training arguments of the HF trainer

In [10]:
# setup the training argument
do_train = True 
do_eval = True 

# epochs, bs, GA
evaluation_strategy = "epoch" # no

# fp16
fp16_opt_level = 'O1'
fp16_backend = "auto"
fp16_full_eval = False

# optimizer (AdamW)
weight_decay = 0.01 # 0.0
adam_beta1 = 0.9
adam_beta2 = 0.999

# scheduler
lr_scheduler_type = 'linear'
warmup_ratio = 0.0
warmup_steps = 0

# logs
logging_strategy = "steps"
logging_first_step = True # False
logging_steps = 500     # if strategy = "steps"
eval_steps = logging_steps # logging_steps

# checkpoints
save_strategy = "epoch" # steps
save_steps = 500 # if save_strategy = "steps"
save_total_limit = 1 # None

# no cuda, seed
no_cuda = False
seed = 42

# bar
disable_tqdm = False # True
remove_unused_columns = True

In [11]:
# folder for training outputs

outputs = model_checkpoint.replace('/','-') + '_' + dataset_name + '/'  
outputs = outputs + str(task) \
+ '_lr' + str(learning_rate) \
+ '_bs' + str(batch_size) \
+ '_GAS' + str(gradient_accumulation_steps) \
+ '_eps' + str(adam_epsilon) \
+ '_epochs' + str(num_train_epochs) \
+ '_madx2' + str(madx2) \
+ '_ds' + str(ds) \
+ '_fp16' + str(fp16) \
+ '_best' + str(load_best_model_at_end) \
+ '_metric' + str(metric_for_best_model)

# path to outputs
path_to_outputs = root/'models_outputs'/outputs

# subfolder for model outputs
output_dir = path_to_outputs/'output_dir' 
overwrite_output_dir = True # False

# logs
logging_dir = path_to_outputs/'logging_dir'

### Lang adapter config

In [12]:
# lang adapter config
adapter_config = "pfeiffer+inv" # houlsby+inv is possible, too
adapter_non_linearity = 'gelu' # relu is possible, too
adapter_reduction_factor = 2
language = 'pt '# pt = Portuguese

## Preparing the dataset

In [13]:
# if dataset_name == "squad11pt":
    
#     # create dataset folder 
#     path_to_dataset = root/'data'/dataset_name
#     path_to_dataset.mkdir(parents=True, exist_ok=True) 

#     # Get dataset SQUAD in Portuguese
#     %cd {path_to_dataset}
#     !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn" -O squad-pt.tar.gz && rm -rf /tmp/cookies.txt

#     # unzip 
#     !tar -xvf squad-pt.tar.gz

#     # Get the train and validation json file in the HF script format 
#     # inspiration: file squad.py at https://github.com/huggingface/datasets/tree/master/datasets/squad
    
#     import json 
#     files = ['squad-train-v1.1.json','squad-dev-v1.1.json']

#     for file in files:

#         # Opening JSON file & returns JSON object as a dictionary 
#         f = open(file, encoding="utf-8") 
#         data = json.load(f) 

#         # Iterating through the json list 
#         context_list = list()
#         id_list = list()

#         for row in data['data']: 

#             for paragraph in row['paragraphs']:
#                 context = (paragraph['context']).strip()
#                 context_list.append(context)

#         # Get unique context
#         unique_context_list = list(set(context_list))

#         # Closing file 
#         f.close() 

#         file_name = 'pt_' + str(file).replace('json','txt')
#         with open(file_name, 'wb') as list_file:
#             pickle.dump(unique_context_list, list_file)
         
#     %cd ../..

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

In [14]:
# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

In [15]:
if dataset_name == "squad11pt":
    
    path_to_data = root/'data'/dataset_name
    files = ['pt_squad-train-v1.1.txt','pt_squad-dev-v1.1.txt']
    
    for i,file in enumerate(files):
        path_to_file = path_to_data/file
        with open(path_to_file, "rb") as f:   # Unpickling
            text_list = pickle.load(f)

            with open(file, "w") as output:
                output.write(str(text_list))
        
        df = pd.DataFrame(text_list,columns=['text'])
        if i == 0:
            df_train = df.copy()
        else:
            df_validation = df.copy()
            
    from datasets import Dataset, DatasetDict
    dataset_train = Dataset.from_pandas(df_train)
    dataset_validation = Dataset.from_pandas(df_validation)

    datasets = DatasetDict()
    datasets['train'] = dataset_train
    datasets['validation'] = dataset_validation

To access an actual element, you need to select a split first, then give an index:

In [16]:
datasets["train"][10]

{'text': 'O panteísmo sustenta que Deus é o universo e o universo é Deus, enquanto o panenteísmo sustenta que Deus contém, mas não é idêntico ao universo. É também a visão da Igreja Católica Liberal; Teosofia; algumas visões do hinduísmo, exceto o vaisnavismo, que acredita no panenteísmo; Sikhismo; algumas divisões do neopaganismo e taoísmo, juntamente com muitas denominações e indivíduos variados dentro das denominações. A Cabala, Misticismo judaico, pinta uma visão panteísta / panenteísta de Deus - que tem ampla aceitação no judaísmo hassídico, particularmente de seu fundador The Baal Shem Tov - mas apenas como um complemento à visão judaica de um deus pessoal, não no panteísta original sensação que nega ou limita a persona a Deus. [citação necessário]'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [17]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [18]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"A Guerra Civil Americana pegou os dois lados despreparados. A Confederação esperava vencer levando a Grã-Bretanha e a França a intervir, ou acabando com a disposição do Norte de lutar. Os EUA buscaram uma rápida vitória focada na captura da capital confederada em Richmond, Virgínia. Os confederados de Robert E. Lee defenderam tenazmente sua capital até o fim. A guerra se espalhou pelo continente e até o alto mar. A maior parte do material e do pessoal do sul foram gastos, enquanto o norte prosperou."
1,"O padrão de radiação de uma antena é um gráfico da força relativa do campo das ondas de rádio emitidas pela antena em diferentes ângulos. Geralmente, é representado por um gráfico tridimensional ou gráficos polares das seções transversais horizontal e vertical. O padrão de uma antena isotrópica ideal, que irradia igualmente em todas as direções, pareceria uma esfera. Muitas antenas não direcionais, como monopólos e dipolos, emitem potência igual em todas as direções horizontais, com a queda de potência em ângulos mais altos e mais baixos; isso é chamado de padrão omnidirecional e, quando plotado, parece um toro ou rosquinha."
2,"Generais soviéticos com vasta experiência em combate da Segunda Guerra Mundial foram enviados à Coréia do Norte como o Grupo Consultivo Soviético. Esses generais completaram os planos para o ataque de Maio. Os planos originais pediam o início de uma escaramuça na península de Ongjin, na costa oeste da Coréia. Os norte-coreanos então lançariam um ""contra-ataque"" que capturaria Seul, cercaria e destruiria o exército sul-coreano. A etapa final envolveria a destruição de remanescentes do governo sul-coreano, capturando o restante da Coréia do Sul, incluindo os portos."
3,"Após o anúncio da morte de Nasser, o Egito e o mundo árabe estavam em estado de choque. A procissão fúnebre de Nasser no Cairo, em 1º de outubro, contou com a presença de pelo menos cinco milhões de pessoas. A procissão de 10 quilômetros até o local do enterro começou na antiga sede do RCC, com a passagem de jatos MiG-21. Seu caixão coberto de bandeira estava preso a uma carruagem puxada por seis cavalos e liderada por uma coluna de cavaleiros. Todos os chefes de estado árabes compareceram, com exceção de Rei saudita Faisal. O rei Hussein e Arafat choraram abertamente, e Muammar Gaddafi da Líbia desmaiou de angústia emocional duas vezes. Alguns dignitários não-árabes importantes estavam presentes, incluindo o primeiro-ministro soviético Alexei Kosygin e o primeiro-ministro francês Jacques Chaban-Delmas."
4,"A maior parte da população negra das Bermudas traça parte de seus ancestrais aos nativos americanos, embora a conscientização disso seja amplamente limitada aos ilhéus de St David e a maioria dos que têm esse ancestral não o conhece. Durante o período colonial, centenas de nativos americanos foram enviados para as Bermudas. Os exemplos mais conhecidos foram os povos algonquianos que foram exilados das colônias do sul da Nova Inglaterra e vendidos como escravos no século XVII, principalmente após as guerras de Pequot e do rei Filipe."
5,"O descarrilamento do metrô de Valência ocorreu em 3 de julho de 2006 às 13h. CEST (1100 UTC) entre as estações Jesús e Plaça d'Espanya na linha 1 do sistema de transporte coletivo de Metrovalencia. 43 pessoas foram mortas e mais de dez ficaram gravemente feridas. Não ficou claro imediatamente o que causou o acidente. Tanto o porta-voz do governo valenciano Vicente Rambla como a prefeita Rita Barberá consideraram o acidente um evento ""fortuito"". No entanto, o sindicato CC.OO. acusou as autoridades de ""apressar-se"" para dizer qualquer coisa, mas admitir que a Linha 1 está em um estado de ""deterioração constante"" com uma ""falha na execução da manutenção""."
6,"O City and Guilds College foi fundado em 1876, a partir de uma reunião de 16 empresas de libré da Cidade de Londres para o Avanço da Educação Técnica (CGLI), que visava melhorar o treinamento de artesãos, técnicos, tecnólogos e engenheiros. Os dois principais objetivos eram criar uma instituição central em Londres e conduzir um sistema de exames qualificados em disciplinas técnicas. Diante da contínua incapacidade de encontrar um local substancial, as Empresas acabaram sendo persuadidas pelo Secretário do Departamento de Ciência e Arte, General Sir Don Donlyly (que também era Engenheiro Real) a fundar sua instituição nos oitenta e sete acres (350.000 m²) em South Kensington comprados pelos Commissioners da Exposição de 1851 (por 342.500 libras) para 'fins de arte e ciência' em perpetuidade. As duas últimas faculdades foram incorporadas pela Royal Charter ao Imperial College of Science and Technology e o CGLI Central Technical College foi renomeado como City and Guilds College em 1907, mas não foi incorporado ao Imperial College até 1910."
7,"O clero católico romano e o protestante mais tradicional usam roupas verdes em celebrações litúrgicas durante o tempo comum. Na Igreja Católica Oriental, o verde é a cor do Pentecostes. O verde também é uma das cores do Natal, possivelmente dos tempos pré-cristãos, quando as sempre-vivas eram adoradas por sua capacidade de manter a cor durante o inverno. Os romanos usavam azevinho verde e sempre-verde como decoração para a celebração do solstício de inverno chamada Saturnalia, que acabou evoluindo para uma celebração de Natal. Especialmente na Irlanda e na Escócia, o verde é usado para representar os católicos, enquanto a laranja é usada para representar o protestantismo. Isso é mostrado na bandeira nacional da Irlanda."
8,"O atual ""Precentor"" (chefe de música) é Tim Johnson, e a escola possui oito órgãos e um edifício inteiro para a música (os espaços para apresentações incluem o School Hall, o Farrer Theatre e dois salões dedicados à música, o Parry Hall e o Concert Corredor). Muitos instrumentos são ensinados, incluindo instrumentos obscuros como o didgeridoo. A escola participa de muitas competições nacionais; muitos alunos fazem parte da Orquestra Nacional da Juventude, e a escola oferece bolsas de estudos para músicos dedicados e talentosos. Antigo Precentor da faculdade, Ralph Allwood montou e organizou os Cursos Eton Choral, que acontecem na escola todos os verões."
9,"O Relatório de Direitos Humanos de 2009 do Departamento de Estado dos Estados Unidos observou que os direitos humanos no CAR eram fracos e manifestou preocupação com inúmeros abusos do governo. O Departamento de Estado dos EUA alegou que os principais abusos dos direitos humanos, como execuções extrajudiciais pelas forças de segurança, tortura, espancamentos e estupro de suspeitos e prisioneiros, ocorreram com impunidade. Também alegou condições severas e ameaçadoras à vida em prisões e centros de detenção, prisão arbitrária, prisão preventiva prolongada e negação de um julgamento justo, restrições à liberdade de circulação, corrupção oficial e restrições aos direitos dos trabalhadores."


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

In [19]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [20]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [21]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])











In [22]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [23]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [24]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)











The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [25]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Lang adapter

In [26]:
# Setup adapters
if train_adapter:
        
    # new
    if madx2:
        # do not add adapter in the last transformer layers 
        leave_out = [len(model.bert.encoder.layer)-1]
    else:
        leave_out = []
        
    # new
    # task_name = data_args.dataset_name or "mlm"
    task_name = "mlm"
        
    # check if adapter already exists, otherwise add it
    if task_name not in model.config.adapters:
            
#             # resolve the adapter config
#             adapter_config = AdapterConfig.load(
#                 adapter_args.adapter_config,
#                 non_linearity=adapter_args.adapter_non_linearity,
#                 reduction_factor=adapter_args.adapter_reduction_factor,
#             )

        # new
        # resolve adapter config with (eventually) the MAD-X 2.0 option
        if adapter_config == "pfeiffer":
            from transformers.adapters.configuration import PfeifferConfig
            adapter_config = PfeifferConfig(non_linearity=adapter_non_linearity,
                                            reduction_factor=adapter_reduction_factor,
                                            leave_out=leave_out)           
        elif adapter_config == "pfeiffer+inv":
            from transformers.adapters.configuration import PfeifferInvConfig
            adapter_config = PfeifferInvConfig(non_linearity=adapter_non_linearity,
                                               reduction_factor=adapter_reduction_factor,
                                               leave_out=leave_out)          
        elif adapter_config == "houlsby":
            from transformers.adapters.configuration import HoulsbyConfig
            adapter_config = HoulsbyConfig(non_linearity=adapter_non_linearity,
                                           reduction_factor=adapter_reduction_factor,
                                           leave_out=leave_out)
        elif adapter_config == "houlsby+inv":
            from transformers.adapters.configuration import HoulsbyInvConfig
            adapter_config = HoulsbyInvConfig(non_linearity=adapter_non_linearity,
                                              reduction_factor=adapter_reduction_factor,
                                              leave_out=leave_out)              
            
        # load a pre-trained from Hub if specified
        if load_adapter:
            model.load_adapter(
                    load_adapter,
                    config=adapter_config,
                    load_as=task_name,
                    with_head = False
                )
        # otherwise, add a fresh adapter
        else:
            model.add_adapter(task_name, config=adapter_config)
                
    # optionally load another pre-trained language adapter
    if load_lang_adapter:
        # resolve the language adapter config
        lang_adapter_config = AdapterConfig.load(
                lang_adapter_config,
                non_linearity=lang_adapter_non_linearity,
                reduction_factor=lang_adapter_reduction_factor,
                leave_out=leave_out,
            )
        # load the language adapter from Hub
        lang_adapter_name = model.load_adapter(
                load_lang_adapter,
                config=lang_adapter_config,
                load_as=language,
                with_head = False
            )
    else:
        lang_adapter_name = None
    # Freeze all model weights except of those of this adapter
    model.train_adapter([task_name])
    # Set the adapters to be used in every forward pass
    if lang_adapter_name:
        model.set_active_adapters([lang_adapter_name, task_name])
    else:
        model.set_active_adapters([task_name])
else:
    if load_adapter or load_lang_adapter:
        raise ValueError(
                "Adapters can only be loaded in adapters training mode."
                "Use --train_adapter to enable adapter training"
            )

In [27]:
model

BertForMaskedLM(
  (bert): BertModel(
    (invertible_adapters): ModuleDict(
      (mlm): NICECouplingBlock(
        (F): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class()
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
        (G): Sequential(
          (0): Linear(in_features=384, out_features=192, bias=True)
          (1): Activation_Function_Class()
          (2): Linear(in_features=192, out_features=384, bias=True)
        )
      )
    )
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            

## Training

In [28]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    do_train=do_train,
    do_eval=do_eval,
    evaluation_strategy=evaluation_strategy,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    adam_beta1=adam_beta1,
    adam_beta2=adam_beta2,
    adam_epsilon=adam_epsilon,
    num_train_epochs=num_train_epochs,
    lr_scheduler_type=lr_scheduler_type,
    warmup_ratio=warmup_ratio,
    warmup_steps=warmup_steps,
    logging_dir=logging_dir,         # directory for storing logs
    logging_strategy=evaluation_strategy,
    logging_steps=logging_steps,     # if strategy = "steps"
    save_strategy=evaluation_strategy,          # model checkpoint saving strategy
    save_steps=logging_steps,        # if strategy = "steps"
    save_total_limit=save_total_limit,
    fp16=fp16,
    eval_steps=logging_steps,        # if strategy = "steps"
    load_best_model_at_end=load_best_model_at_end,
    metric_for_best_model=metric_for_best_model,
    greater_is_better=greater_is_better,
    )

if ds:
    training_args.deepspeed = ds_config

And second, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [29]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Let's define a compute metrics (accuracy). Even if it is always better to eveluate a model against a metric, we will not use it to evaluate the best model during the training as it can make a CUDA out of memory. Instead, we will use the validation loss (in the case of fine-tuning a MLM on  a new dataset, it is a common procedure). At the end of the training, we will use our compute metrics (accuracy) to get the performance of our model.

In [30]:
# metric accuracy
from datasets import load_metric
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    indices = [[i for i, x in enumerate(labels[row]) if x != -100] for row in range(len(labels))]

    labels = [labels[row][indices[row]] for row in range(len(labels))]
    temp = list()
    for item in labels:
        temp += item.tolist()
    labels = temp

    predictions = [predictions[row][indices[row]] for row in range(len(predictions))]
    temp = list()
    for item in predictions:
        temp += item.tolist()
    predictions = temp
    
    results = metric.compute(predictions=predictions, references=labels)
    results["eval_accuracy"] = results["accuracy"]
    results.pop("accuracy")

    return results

Then we just have to pass everything to `Trainer` and begin training:

In [31]:
from transformers.trainer_callback import EarlyStoppingCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"], # .shard(index=1, num_shards=90), to be used to reduce train to 1/90
    eval_dataset=lm_datasets["validation"], #.shard(index=1, num_shards=90), to be used to reduce validation to 1/90
    tokenizer=tokenizer,
    data_collator=data_collator,
#     compute_metrics=compute_metrics,
    do_save_full_model=not train_adapter, 
    do_save_adapters=train_adapter,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=early_stopping_patience)],
    )    

In [None]:
trainer.args._n_gpu = n_gpu # train on one GPU
trainer.train()

In [None]:
# add the metric accuracy
trainer.compute_metrics=compute_metrics

# calculation of the performance on the validation set
eval_results = trainer.evaluate()

In [None]:
import math
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
print(f"Accuracy: {eval_results['eval_accuracy']:.2f}")

In [None]:
# save adapter + head
adapters_folder = 'adapters-' + task_name
path_to_save_adapter = path_to_outputs/adapters_folder
trainer.model.save_adapter(str(path_to_save_adapter), adapter_name=task_name, with_head=True)

!ls -lh {path_to_save_adapter}

Now, you can push the saved adapter + head to the [AdapterHub](https://adapterhub.ml/) (follow instructions at [Contributing to Adapter Hub](https://docs.adapterhub.ml/contributing.html)).

## TensorBoard

In [None]:
#!pip install tensorboard

In [None]:
import os
PATH = os.getenv('PATH')
# replace xxxx by your username on your server (ex: paulo)
# replace yyyy by the name of the virtual environment of this notebook (ex: adapter-transformers)
%env PATH=/mnt/home/xxxx/anaconda3/envs/yyyy/bin:$PATH

In [None]:
%load_ext tensorboard
# %reload_ext tensorboard
%tensorboard --logdir {logging_dir} --bind_all

# END