<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Language Modeling
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Encoder pretraining using Masked Language Modeling task
  </div> 


  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE - Hybrid Intelligence
  </div> 

  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Dataset](#data) <br>
2. [ALBERT finetuning](#albert) <br>
3. [Inference](#inference) <br>



#### Reference

- Hugginface full list of [tutorial notebooks](https://github.com/huggingface/transformers/tree/main/notebooks) (see also [here](https://huggingface.co/docs/transformers/main/notebooks#pytorch-examples))
- Huggingface full list of [training scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch)
- Huggingface [tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb) on language models
- Huggingface [course](https://huggingface.co/course/chapter7/3?fw=tf) on language models
- Huggingface [training script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) on language models
- Albert [original training protocol](https://github.com/google-research/albert)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import re
import random
import copy
import string
from itertools import chain

# data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import (
    Dataset,  
    DatasetDict,
    ClassLabel, 
    Features, 
    Sequence, 
    Value,
    load_from_disk,
)
from transformers import AlbertConfig, AutoConfig, DataCollatorForLanguageModeling

# DL
import torch
from gensim.models import Word2Vec
import transformers
from transformers import (
    AutoTokenizer, 
    AutoModel,
    AutoModelForMaskedLM, 
    TrainingArguments, 
    Trainer,
    pipeline,
    set_seed,
)
import evaluate

# viz
from IPython.display import HTML

  from .autonotebook import tqdm as notebook_tqdm


#### Transformers settings

In [3]:
transformers.__version__

'4.22.2'

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [5]:
# make training deterministic
set_seed(42)

#### Custom paths & imports

In [6]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_data = os.path.join(path_to_repo, 'datasets', 'clinical trials ICTRP')
path_to_save = os.path.join(path_to_repo, 'saves', 'MLM')
path_to_src  = os.path.join(path_to_repo, 'src')

In [7]:
sys.path.insert(0, path_to_src)

#### Constants

In [8]:
dataset_name = 'clinical-trials-ictrp'
final_dataset_name = 'clinical-trials-ictrp-tokenized-blocks'
base_model_name = "albert-base-v2"
final_model_name = "albert-small-ictrp"

<a id="data"></a>

# 1. Dataset

[Table of content](#TOC)

We generate a collection of instances of the `datasets.Dataset` class. 

Note that these are different from the fairly generic `torch.utils.data.Dataset` class. 

## 1.1 Load Clinical Trials corpus

[Table of content](#TOC)

In [9]:
with open(os.path.join(path_to_data, '{}.txt'.format(dataset_name)), 'r', encoding = 'utf-8') as f:
    texts = [t.strip() for t in f.readlines()]

In [10]:
dataset = Dataset.from_dict({'text': texts}, features = Features({'text': Value(dtype = 'string')}))

In [11]:
len(dataset)

1081670

In [12]:
dataset[:3]

{'text': ["Capable of giving signed informed consent. Has received, been intolerant to, or been ineligible for all treatment options proven, to confer clinical benefit. Measurable disease per Response Evaluation Criteria in Solid Tumors (RECIST) 1.1. Eastern Cooperative Oncology Group (ECOG) Performance status (PS) of 0 or 1. Adequate organ function. Male individuals and female individuals of childbearing potential who engage in, heterosexual intercourse must agree to use methods of contraception. Female participants are eligible if they are not pregnant, not breastfeeding or not a, Woman of childbearing potential (WOCBP). Inclusion criterion for the dose-escalation: Individuals with histologically or, cytologically confirmed, advanced or metastatic solid tumors. Inclusion criterion for disease-specific combination expansion: Individuals with, histologically or cytologically confirmed Triple-negative breast cancer (TNBC), Non-small cell lung cancer (NSCLC), Head and neck squamous cell 

## 1.2 Load Clinical-Albert-small tokenizer

[Table of content](#TOC)

In [9]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))

## 1.3 Tokenize corpus

[Table of content](#TOC)

In [16]:
# We use this option because DataCollatorForLanguageModeling (see below) is more efficient 
# when it receives the `special_tokens_mask`.
def tokenize_text(examples, tokenizer):
    # Remove empty lines
    examples['text'] = [
        t for t in examples['text'] if len(t) > 0 and not t.isspace()
    ]
    return tokenizer(examples["text"], return_special_tokens_mask = True)

In [17]:
tokenized_dataset = dataset.map(
    lambda examples: tokenize_text(examples, tokenizer), 
    batched = True, 
    remove_columns = ["text"],
)

100%|██████████████████████████████████████████████████████████████████████████████| 1082/1082 [04:52<00:00,  3.70ba/s]


By contrast to the generic BIO annotated data, this new data depends on the tokenizer, and is therefore _model-specific_.

_Note_: the argument `remove_columns = ["text"]` is mandatory, in order to have each item of the dataset have same length.

In [18]:
len(tokenized_dataset[0]['input_ids'])

740

In [19]:
tokenized_dataset[0]

{'input_ids': [2,
  857,
  8,
  1377,
  343,
  90,
  78,
  7,
  72,
  166,
  6,
  172,
  2225,
  13,
  6,
  10,
  172,
  1311,
  20,
  159,
  56,
  3610,
  1085,
  6,
  13,
  8934,
  103,
  1432,
  7,
  483,
  42,
  196,
  515,
  473,
  96,
  21,
  881,
  586,
  5,
  12,
  2191,
  11,
  5,
  18,
  7,
  18,
  7,
  826,
  734,
  702,
  337,
  5,
  12,
  682,
  11,
  335,
  203,
  5,
  12,
  84,
  15,
  11,
  8,
  5,
  92,
  10,
  5,
  18,
  7,
  261,
  423,
  176,
  7,
  189,
  621,
  17,
  137,
  621,
  8,
  242,
  139,
  54,
  2721,
  21,
  6,
  1784,
  1257,
  69,
  309,
  13,
  82,
  402,
  8,
  188,
  7,
  137,
  209,
  59,
  222,
  5,
  27,
  39,
  348,
  59,
  60,
  123,
  6,
  60,
  482,
  10,
  60,
  5,
  14,
  6,
  738,
  8,
  242,
  139,
  5,
  12,
  1961,
  11,
  7,
  215,
  966,
  20,
  9,
  118,
  16,
  10937,
  38,
  621,
  19,
  531,
  10,
  6,
  1129,
  213,
  6,
  524,
  10,
  399,
  881,
  586,
  7,
  215,
  966,
  20,
  42,
  16,
  1049,
  737,
  3269,
  38,
  621,
  

## 1.4 Form blocks of constant length

[Table of content](#TOC)


In [20]:
def group_texts(examples, block_size):
    # Concatenate all texts.
    keys = [k for k in examples.keys() if k != 'text']
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[keys[0]])
    
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    
    # Split by chunks of max_len.
    result = {
        k: [t[i : i+block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [21]:
block_size = 512

In [22]:
# run only once
# mlm_dataset = tokenized_dataset.map(lambda examples: group_texts(examples, block_size), batched = True)
# mlm_dataset.save_to_disk(os.path.join(path_to_data, final_dataset_name))

100%|██████████████████████████████████████████████████████████████████████████████| 1082/1082 [09:32<00:00,  1.89ba/s]


In [10]:
mlm_dataset = load_from_disk(os.path.join(path_to_data, final_dataset_name))

In [11]:
len(mlm_dataset)

595255

In [12]:
print(mlm_dataset[0])

{'input_ids': [2, 857, 8, 1377, 343, 90, 78, 7, 72, 166, 6, 172, 2225, 13, 6, 10, 172, 1311, 20, 159, 56, 3610, 1085, 6, 13, 8934, 103, 1432, 7, 483, 42, 196, 515, 473, 96, 21, 881, 586, 5, 12, 2191, 11, 5, 18, 7, 18, 7, 826, 734, 702, 337, 5, 12, 682, 11, 335, 203, 5, 12, 84, 15, 11, 8, 5, 92, 10, 5, 18, 7, 261, 423, 176, 7, 189, 621, 17, 137, 621, 8, 242, 139, 54, 2721, 21, 6, 1784, 1257, 69, 309, 13, 82, 402, 8, 188, 7, 137, 209, 59, 222, 5, 27, 39, 348, 59, 60, 123, 6, 60, 482, 10, 60, 5, 14, 6, 738, 8, 242, 139, 5, 12, 1961, 11, 7, 215, 966, 20, 9, 118, 16, 10937, 38, 621, 19, 531, 10, 6, 1129, 213, 6, 524, 10, 399, 881, 586, 7, 215, 966, 20, 42, 16, 1049, 737, 3269, 38, 621, 19, 6, 531, 10, 1129, 213, 3825, 16, 1761, 249, 114, 5, 12, 11975, 11, 6, 122, 16, 1998, 177, 380, 114, 5, 12, 4487, 11, 6, 764, 17, 900, 601, 177, 224, 5, 12, 12998, 11, 6, 10, 6, 1335, 7, 215, 966, 20, 9, 2168, 16, 66, 30, 14, 3269, 38, 621, 19, 531, 6, 10, 1129, 213, 1226, 7, 691, 17, 112, 5, 39, 22, 14, 1

In [13]:
print(tokenizer.decode(mlm_dataset[0]["input_ids"]), tokenizer.decode(mlm_dataset[0]["labels"]))

[CLS] capable of giving signed informed consent. has received, been intolerant to, or been ineligible for all treatment options proven, to confer clinical benefit. measurable disease per response evaluation criteria in solid tumors (recist) 1.1. eastern cooperative oncology group (ecog) performance status (ps) of 0 or 1. adequate organ function. male individuals and female individuals of childbearing potential who engage in, heterosexual intercourse must agree to use methods of contraception. female participants are eligible if they are not pregnant, not breastfeeding or not a, woman of childbearing potential (wocbp). inclusion criterion for the dose-escalation: individuals with histologically or, cytologically confirmed, advanced or metastatic solid tumors. inclusion criterion for disease-specific combination expansion: individuals with, histologically or cytologically confirmed triple-negative breast cancer (tnbc), non-small cell lung cancer (nsclc), head and neck squamous cell carci

<a id="albert"></a>

# 2. ALBERT-small training

[Table of content](#TOC)

#### Different training strategies

1. Fully randomly initiated model
2. Model with pre-trained token embedding for both encoder and LM head, with weight tying
3. Model with frozen pre-trained token embedding for both encoder and LM head, with weight tying
4. Model with frozen pre-trained token embedding for the encoder, and untied fully learnable LM head (recommended lr = 5e-4)

## 2.1 Build Clinical-Albert-small model

[Table of content](#TOC)

In [14]:
# original Albert config
config = AutoConfig.from_pretrained(base_model_name)
config

AlbertConfig {
  "_name_or_path": "albert-base-v2",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.2",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

In [57]:
# # a smaller Albert config
config = AlbertConfig(
    attention_probs_dropout_prob = 0,
    pad_token_id = 0,
    bos_token_id = 2,
    eos_token_id = 3,
    classifier_dropout_prob = 0.1,
    down_scale_factor = 1,
    embedding_size = 128,
    gap_size = 0,
    hidden_act = 'gelu_new',
    hidden_dropout_prob = 0,
    hidden_size = 512, # 768,
    initializer_range = 0.02,
    inner_group_num = 1,
    intermediate_size = 2048, # 3072,
    layer_norm_eps = 1e-12,
    max_position_embeddings = 512,
    model_type = 'albert',
    net_structure_type = 0,
    num_attention_heads = 8, # 12
    num_hidden_groups = 1,
    num_hidden_layers = 8, # 12
    num_memory_blocks = 0,
    position_embedding_type = 'absolute',
    transformers_version = '4.22.2',
    type_vocab_size = 2,
    vocab_size = 15000, # 30000,
    tie_word_embeddings = False, # True,
)
model = AutoModelForMaskedLM.from_config(config)

In [58]:
model.num_parameters()

7205400

In [59]:
model = model.to(device)

## 2.2 Load pre-trained token embedding matrix

[Table of content](#TOC)

We initiate word embeddings with a pre-trained table.<br>
As a by-product, word embedding weights are no longer shared between encoder and classification head, in line with more recent models such as [T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)

In [60]:
print(model.albert.embeddings.word_embeddings._parameters['weight'][1][:10])
print(model.predictions.decoder._parameters['weight'][1][:10])

tensor([ 0.0043,  0.0418, -0.0339,  0.0004, -0.0023, -0.0130, -0.0010,  0.0021,
        -0.0221, -0.0111], device='cuda:0', grad_fn=<SliceBackward0>)
tensor([-0.0141, -0.0102, -0.0209, -0.0094, -0.0127,  0.0317,  0.0097,  0.0114,
         0.0061, -0.0220], device='cuda:0', grad_fn=<SliceBackward0>)


In [61]:
wv = Word2Vec.load(os.path.join(path_to_save, final_model_name, 'w2v', 'sgram')).wv

In [62]:
# reindex rows in embedding table
base_id2w = {v: k for k, v in tokenizer.get_vocab().items()}
reindexing = [wv.key_to_index[base_id2w[i]] for i in range(len(base_id2w))]

In [63]:
reindexing[:5]

[14995, 14997, 14998, 14999, 14996]

In [64]:
# for i, reind_i in enumerate(reindexing[:5]):
#     print(reind_i)
#     print(wv.index_to_key[reind_i])
#     print(base_id2w[i])

In [65]:
token_embeddings = wv.vectors[reindexing] # wv.get_normed_vectors()[reindexing]
token_embeddings.shape

(15000, 128)

In [66]:
model.albert.embeddings.word_embeddings._parameters['weight'].norm(dim = -1).tolist()

[0.0,
 0.2151859700679779,
 0.22248247265815735,
 0.2173531949520111,
 0.213483527302742,
 0.21391838788986206,
 0.22306038439273834,
 0.22632883489131927,
 0.24100326001644135,
 0.19945749640464783,
 0.21178671717643738,
 0.22909106314182281,
 0.22435908019542694,
 0.22845502197742462,
 0.21465665102005005,
 0.21138404309749603,
 0.2195754200220108,
 0.2280760258436203,
 0.23290695250034332,
 0.22167114913463593,
 0.20815923810005188,
 0.25412580370903015,
 0.20920330286026,
 0.20746605098247528,
 0.21184080839157104,
 0.22953960299491882,
 0.24783912301063538,
 0.2605670392513275,
 0.25160589814186096,
 0.24019977450370789,
 0.2148076742887497,
 0.22413262724876404,
 0.2092655599117279,
 0.20790372788906097,
 0.23224221169948578,
 0.20780456066131592,
 0.2235240638256073,
 0.21616925299167633,
 0.242834210395813,
 0.23521959781646729,
 0.21442723274230957,
 0.23593056201934814,
 0.2229703962802887,
 0.22761991620063782,
 0.22058455646038055,
 0.2191784828901291,
 0.2490020990371704,


In [67]:
np.linalg.norm(token_embeddings, axis = -1).tolist()

[0.05029847100377083,
 0.04985176399350166,
 0.04812698811292648,
 0.05405382812023163,
 0.05249904468655586,
 1.7227306365966797,
 1.8393648862838745,
 1.9135942459106445,
 2.096674919128418,
 1.9929710626602173,
 1.8200165033340454,
 1.8425917625427246,
 2.1484005451202393,
 2.278388261795044,
 1.9871857166290283,
 2.1129376888275146,
 2.4063568115234375,
 1.9516831636428833,
 2.530545473098755,
 2.0679008960723877,
 2.5778214931488037,
 2.2181642055511475,
 2.1268270015716553,
 2.284655809402466,
 2.182380199432373,
 2.623713254928589,
 2.6132137775421143,
 2.0911061763763428,
 2.2124714851379395,
 1.9270639419555664,
 2.1363606452941895,
 2.244401216506958,
 2.4619646072387695,
 2.2495038509368896,
 2.8109073638916016,
 2.169473886489868,
 2.06711483001709,
 2.4892654418945312,
 2.5276334285736084,
 3.243286609649658,
 3.1745095252990723,
 2.8597164154052734,
 2.6179184913635254,
 2.607361078262329,
 2.1849875450134277,
 2.2061665058135986,
 1.8428137302398682,
 2.382831573486328,


In [68]:
# it is tempting to try to resize vectors so that they have norm at least one,
# but it is actually very harmful, so don't do this

# resizing = np.maximum(1, (1/np.linalg.norm(token_embeddings, axis = -1))).reshape(-1, 1).repeat(128, axis = 1)
# token_embeddings = token_embeddings * resizing
# np.linalg.norm(token_embeddings, axis = -1).tolist()

In [70]:
# remark: if weights were tied, then the model is subsequently in the following state:
# - weights are now untied
# - encoder weights are overwritten by new weights
# - decoder weights are left as they were initially (e.g not affected by loading pretrained weights)
model.albert.embeddings.word_embeddings = model.albert.embeddings.word_embeddings.from_pretrained(
    torch.tensor(token_embeddings), 
    padding_idx = tokenizer._pad_token_type_id,
    freeze = True, # default is True
)

In [71]:
# if there was a weight tying with LM head, then loading pretrained word embeddings broke it
print(model.num_parameters(), sum(p.numel() for p in model.parameters() if p.requires_grad))

# only encoder weights were affected, decoder weights are left as is
print(model.albert.embeddings.word_embeddings._parameters['weight'][1][:10])
print(model.predictions.decoder._parameters['weight'][1][:10])

7205400 5285400
tensor([-0.0031,  0.0003,  0.0060,  0.0042,  0.0035, -0.0001, -0.0077, -0.0075,
         0.0052, -0.0036])
tensor([-0.0141, -0.0102, -0.0209, -0.0094, -0.0127,  0.0317,  0.0097,  0.0114,
         0.0061, -0.0220], device='cuda:0', grad_fn=<SliceBackward0>)


In [72]:
# # if we want to freeze "manually" the word embedding layer
# for param in model.albert.embeddings.word_embeddings.parameters():
#     param.requires_grad = False
    
print(model.num_parameters(), sum(p.numel() for p in model.parameters() if p.requires_grad))

# decoder word embeddings are frozen only when encoder is frozen and weights are tied (except bias vector)
for param in model.predictions.decoder.parameters():
    print(param.requires_grad)

7205400 5285400
True
True


## 2.3 Model training

[Table of content](#TOC)

`Albert-vase-v2` training parameters as provided in https://github.com/google-research/albert/blob/master/run_pretraining.py : 
- max_predictions_per_seq = `20`
- train_batch_size = `4096`
- optimizer = `"lamb"`
- learning_rate = `0.00176`
- poly_power = `1.0`
- num_train_steps = `125000`
- num_warmup_steps = `3125`
- start_warmup_step = `0`
- iterations_per_loop = `1000`

The original optimizer is `lamb`, which was designed for very large batch size, see the [Lamb paper](https://arxiv.org/pdf/1904.00962.pdf), but we use here the default [AdamW](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.AdamW) optimizer with [linear learning rate decay](https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/optimizer_schedules#transformers.get_linear_schedule_with_warmup), as specified in the [Trainer class documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.optimizers). See the [AdamW paper](https://arxiv.org/pdf/1711.05101.pdf).

In [73]:
model = model.train()

In [74]:
batch_size = 16 # <= 16 for batch_size = 512 and GPU RAM = 8GB

args = TrainingArguments(
    os.path.join(path_to_save, '_checkpoints'),
    evaluation_strategy = "no",
    learning_rate = 5e-4,
    num_train_epochs = 3,
    lr_scheduler_type = 'linear', # 'constant_with_warmup',
    warmup_steps = 1000,
    gradient_accumulation_steps = 1,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    save_strategy = 'no',
    logging_steps = 100,
    seed = 42,
    data_seed = 23,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [75]:
trainer = Trainer(
    model,
    args,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

Some remarks:

- The `data_collator` is the object used to batch elements of the training & evaluation datasets.
- The `tokenizer` is provided in order to automatically pad the inputs to the maximum length when batching inputs, and to have it saved along the model, which makes it easier to rerun an interrupted training or reuse the fine-tuned model.

In [76]:
torch.cuda.empty_cache()

In [None]:
# pretrained token embedding, frozen, untied, lr = 5e-4
# pretrained token embedding, tied, lr = 5e-4
# pretrained token embedding, frozen, tied, lr = 5e-4
trainer.train()

In [None]:
model = model.to('cpu')

In [None]:
model.save_pretrained(os.path.join(path_to_save, final_model_name, 'model'))

## 2.4 Model training with unfrozen token embedding table

In [14]:
model = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model'))
model = model.to(device).train()

In [15]:
# loading pretrained word embeddings has unfrozen word embeddings
# print(model.num_parameters(), sum(p.numel() for p in model.parameters() if p.requires_grad))

# freeze word embeddings if desired
# for param in model.parameters():
#     param.requires_grad = True
    
print(model.num_parameters(), sum(p.numel() for p in model.parameters() if p.requires_grad))

7205400 7205400


In [16]:
batch_size = 16 # <= 16 for batch_size = 512 and GPU RAM = 8GB

args = TrainingArguments(
    os.path.join(path_to_save, '_checkpoints'),
    evaluation_strategy = "no",
    learning_rate = 5e-5,
    num_train_epochs = 1,
    lr_scheduler_type = 'linear',
    warmup_steps = 0,
    gradient_accumulation_steps = 1,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    save_strategy = 'no',
    logging_steps = 100,
    seed = 42,
    data_seed = 23,
)

In [17]:
trainer = Trainer(
    model,
    args,
    tokenizer = tokenizer,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

In [18]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 595255
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 37204
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,1.0816
200,1.0564
300,1.0631
400,1.06
500,1.0724
600,1.0724
700,1.0514
800,1.0651
900,1.0804
1000,1.0724




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=37204, training_loss=1.044597110419155, metrics={'train_runtime': 12584.6073, 'train_samples_per_second': 47.3, 'train_steps_per_second': 2.956, 'total_flos': 9544697118842880.0, 'train_loss': 1.044597110419155, 'epoch': 1.0})

In [19]:
model = model.to('cpu')

In [20]:
trainer.save_model(os.path.join(path_to_save, final_model_name, 'model'))

Saving model checkpoint to C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp\model
Configuration saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp\model\config.json
Model weights saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp\model\pytorch_model.bin
tokenizer config file saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp\model\tokenizer_config.json
Special tokens file saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp\model\special_tokens_map.json


<a id="inference"></a>

# 3. Inference

[Table of content](#TOC)

In [12]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))
model = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model'))

In [21]:
mlm = pipeline(
    task = 'fill-mask', 
    model = model, 
    tokenizer = tokenizer,
    framework = 'pt',
)

In [22]:
sent = 'Polyneuropathy of other causes, including but not limited to hereditary demyelinating neuropathies, neuropathies secondary to infection or systemic disease, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor neuropathy, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'
sent = f'Polyneuropathy of other causes, including but not limited to  {mlm.tokenizer.mask_token} demyelinating neuropathies, {mlm.tokenizer.mask_token} secondary to infection or systemic {mlm.tokenizer.mask_token}, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor {mlm.tokenizer.mask_token}, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'
mlm(sent, top_k = 5)

[[{'score': 0.7487446665763855,
   'token': 6,
   'token_str': ',',
   'sequence': '[CLS] polyneuropathy of other causes, including but not limited to, demyelinating neuropathies,[MASK] secondary to infection or systemic[MASK], diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor[MASK], monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory cidp and acquired demyelinating symmetric (dads) neuropathy (also known as distal cidp).[SEP]'},
  {'score': 0.05170542001724243,
   'token': 38,
   'token_str': ':',
   'sequence': '[CLS] polyneuropathy of other causes, including but not limited to: demyelinating neuropathies,[MASK] secondary to infection or systemic[MASK], diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor[MASK], monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory cidp and acquired demyelinating symmetric (dads) neuropathy (also known as dist

[Table of content](#TOC)