<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Language Modeling
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Encoder pretraining using Masked Language Modeling task
  </div> 


  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE - Hybrid Intelligence
  </div> 

  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Dataset](#data) <br>
2. [ALBERT finetuning](#albert) <br>
3. [Inference](#inference) <br>



#### Reference

- Hugginface full list of [tutorial notebooks](https://github.com/huggingface/transformers/tree/main/notebooks) (see also [here](https://huggingface.co/docs/transformers/main/notebooks#pytorch-examples))
- Huggingface full list of [training scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch)
- Huggingface [tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb) on language models
- Huggingface [course](https://huggingface.co/course/chapter7/3?fw=tf) on language models
- Huggingface [training script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) on language models
- Albert [original training protocol](https://github.com/google-research/albert)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import re
import random
import copy
import string
from itertools import chain

# data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import (
    Dataset, 
    DatasetDict,
    ClassLabel, 
    Features, 
    Sequence, 
    Value,
    load_from_disk,
)
from transformers import AlbertConfig, AutoConfig, DataCollatorForLanguageModeling

# DL
import torch
from gensim.models import Word2Vec
import transformers
from transformers import (
    AutoTokenizer, 
    AutoModelForMaskedLM, 
    TrainingArguments, 
    Trainer,
    pipeline,
    set_seed,
)
import evaluate

# viz
from IPython.display import HTML

  from .autonotebook import tqdm as notebook_tqdm


#### Transformers settings

In [3]:
transformers.__version__

'4.22.2'

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [5]:
# make training deterministic
set_seed(42)

#### Custom paths & imports

In [6]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_data = os.path.join(path_to_repo, 'datasets', 'clinical trials CTTI')
path_to_save = os.path.join(path_to_repo, 'saves', 'MLM')
path_to_src  = os.path.join(path_to_repo, 'src')

In [7]:
sys.path.insert(0, path_to_src)

#### Constants

In [8]:
dataset_name = 'clinical-trials-ctti'
final_dataset_name = 'clinical-trials-ctti-tokenized-blocks'
base_model_name = "albert-base-v2"
final_model_name = "albert-small-clinical-trials"

<a id="data"></a>

# 1. Dataset

[Table of content](#TOC)

We generate a collection of instances of the `datasets.Dataset` class. 

Note that these are different from the fairly generic `torch.utils.data.Dataset` class. 

## 1.1 Load Clinical Trials corpus

[Table of content](#TOC)

In [9]:
with open(os.path.join(path_to_data, '{}.txt'.format(dataset_name)), 'r', encoding = 'utf-8') as f:
    texts = [t.strip() for t in f.readlines()]

In [10]:
dataset = Dataset.from_dict({'text': texts}, features = Features({'text': Value(dtype = 'string')}))

In [11]:
len(dataset)

430108

In [12]:
dataset[:3]

{'text': ['This study will test the ability of extended release nifedipine (Procardia XL), a blood pressure medication, to permit a decrease in the dose of glucocorticoid medication children take to treat congenital adrenal hyperplasia (CAH). This protocol is designed to assess both acute and chronic effects of the calcium channel antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels, as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA axis? Such a decrease would, 

## 1.2 Load Clinical-Albert-small tokenizer

[Table of content](#TOC)

In [19]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))

## 1.3 Tokenize corpus

[Table of content](#TOC)

In [14]:
# We use this option because DataCollatorForLanguageModeling (see below) is more efficient 
# when it receives the `special_tokens_mask`.
def tokenize_text(examples, tokenizer, block_size):
    # Remove empty lines
    examples['text'] = [
        t for t in examples['text'] if len(t) > 0 and not t.isspace()
    ]
    return tokenizer(examples["text"], return_special_tokens_mask = True)

In [15]:
tokenized_dataset = dataset.map(
    lambda examples: tokenize_text(examples, tokenizer, block_size = 512), 
    batched = True, 
    remove_columns = ["text"],
)

100%|████████████████████████████████████████████████████████████████████████████████| 431/431 [03:24<00:00,  2.11ba/s]


By contrast to the generic BIO annotated data, this new data depends on the tokenizer, and is therefore _model-specific_.

_Note_: the argument `remove_columns = ["text"]` is mandatory, in order to have each item of the dataset have same length.

In [16]:
len(tokenized_dataset[0]['input_ids'])

280

In [17]:
tokenized_dataset[0]

{'input_ids': [2,
  38,
  28,
  22,
  122,
  7,
  5,
  233,
  9,
  2023,
  1289,
  9167,
  92,
  20,
  5,
  19,
  922,
  7723,
  5,
  125,
  35,
  18,
  6,
  5,
  11,
  80,
  245,
  318,
  6,
  13,
  4364,
  5,
  11,
  580,
  15,
  7,
  106,
  9,
  6067,
  318,
  172,
  479,
  13,
  844,
  1636,
  4276,
  5496,
  5,
  19,
  31,
  11,
  57,
  18,
  8,
  38,
  285,
  25,
  5,
  470,
  13,
  161,
  182,
  223,
  12,
  174,
  144,
  9,
  7,
  1486,
  3565,
  2371,
  6,
  9167,
  92,
  20,
  6,
  41,
  7,
  7988,
  14,
  4757,
  14,
  11,
  16,
  4427,
  3299,
  15,
  29,
  23,
  1636,
  4276,
  5496,
  8,
  7,
  843,
  98,
  25,
  3625,
  16,
  9,
  108,
  3045,
  12,
  22,
  685,
  5,
  11,
  478,
  14,
  658,
  6,
  204,
  14,
  783,
  1151,
  330,
  8,
  7,
  640,
  9,
  152,
  5,
  21,
  25,
  13,
  546,
  7,
  5,
  233,
  9,
  9167,
  92,
  20,
  5,
  66,
  10,
  8,
  204,
  13,
  580,
  8602,
  4650,
  4583,
  889,
  5,
  19,
  1217,
  57,
  18,
  240,
  6,
  43,
  179,
  43,
  13,
 

## 1.4 Form blocks of constant length

[Table of content](#TOC)


In [18]:
def group_texts(examples, block_size):
    # Concatenate all texts.
    keys = [k for k in examples.keys() if k != 'text']
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[keys[0]])
    
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    
    # Split by chunks of max_len.
    result = {
        k: [t[i : i+block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [19]:
mlm_dataset = tokenized_dataset.map(lambda examples: group_texts(examples, block_size = 512), batched = True)
mlm_dataset.save_to_disk(os.path.join(path_to_data, final_dataset_name))

100%|████████████████████████████████████████████████████████████████████████████████| 430/430 [06:33<00:00,  1.09ba/s]


In [9]:
mlm_dataset = load_from_disk(os.path.join(path_to_data, final_dataset_name))

In [10]:
len(mlm_dataset)

430458

In [11]:
print(mlm_dataset[0])

{'input_ids': [2, 38, 28, 22, 122, 7, 5, 233, 9, 2023, 1289, 9167, 92, 20, 5, 19, 922, 7723, 5, 125, 35, 18, 6, 5, 11, 80, 245, 318, 6, 13, 4364, 5, 11, 580, 15, 7, 106, 9, 6067, 318, 172, 479, 13, 844, 1636, 4276, 5496, 5, 19, 31, 11, 57, 18, 8, 38, 285, 25, 5, 470, 13, 161, 182, 223, 12, 174, 144, 9, 7, 1486, 3565, 2371, 6, 9167, 92, 20, 6, 41, 7, 7988, 14, 4757, 14, 11, 16, 4427, 3299, 15, 29, 23, 1636, 4276, 5496, 8, 7, 843, 98, 25, 3625, 16, 9, 108, 3045, 12, 22, 685, 5, 11, 478, 14, 658, 6, 204, 14, 783, 1151, 330, 8, 7, 640, 9, 152, 5, 21, 25, 13, 546, 7, 5, 233, 9, 9167, 92, 20, 5, 66, 10, 8, 204, 13, 580, 8602, 4650, 4583, 889, 5, 19, 1217, 57, 18, 240, 6, 43, 179, 43, 13, 1737, 13, 161, 7, 106, 14, 5132, 30, 9, 9167, 92, 20, 144, 8, 7, 640, 9, 152, 5, 21, 21, 25, 13, 119, 7, 302, 14, 424, 144, 9, 9167, 92, 20, 62, 50, 25, 6, 101, 7615, 9, 1394, 57, 1289, 48, 9167, 92, 20, 4364, 5, 11, 580, 15, 7, 1704, 9, 6067, 5, 753, 13, 5, 10, 163, 39, 4495, 7, 5, 57, 39, 11, 3299, 1161, 1

In [20]:
print(tokenizer.decode(mlm_dataset[0]["input_ids"]), tokenizer.decode(mlm_dataset[0]["labels"]))

[CLS] this study will test the ability of extended release nifedipine (procardia xl), a blood pressure medication, to permit a decrease in the dose of glucocorticoid medication children take to treat congenital adrenal hyperplasia (cah). this protocol is designed to assess both acute and chronic effects of the calcium channel antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. the multicenter trial is composed of two phases and will involve a double-blind, placebo-controlled parallel design. the goal of phase i is to examine the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (acth) levels, as well as to begin to assess the dose-dependency of nifedipine effects. the goal of phase ii is to evaluate the long-term effects of nifedipine; that is, can attenuation of acth release by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the hpa axis? such a decrease would, in tu

<a id="albert"></a>

# 2. ALBERT-small training

[Table of content](#TOC)

#### Tested combinations

- 1.4M parameter model: converges fast (1 epoch) towards confusion score~=2.2. Issue : Finetuning of NER on Chia hard, stuck to high training error and/or provides evaluation errors
- 3.5M parameter model: gets stuck at confusion score~=5.9. Training args : block_size = 512, bs = 16, lr = 1e-4, grad_acc_step = 4, warmup_step = 500, num_layer = 8

## 2.1 Build Clinical-Albert-small model

[Table of content](#TOC)

In [21]:
# original Albert config
config = AutoConfig.from_pretrained(base_model_name)
config

AlbertConfig {
  "_name_or_path": "albert-base-v2",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 12,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.2",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

In [22]:
# # a smaller Albert config
config = AlbertConfig(
    attention_probs_dropout_prob = 0,
    pad_token_id = 0,
    bos_token_id = 2,
    eos_token_id = 3,
    classifier_dropout_prob = 0.1,
    down_scale_factor = 1,
    embedding_size = 128,
    gap_size = 0,
    hidden_act = 'gelu_new',
    hidden_dropout_prob = 0,
    hidden_size = 512, # 768,
    initializer_range = 0.02,
    inner_group_num = 1,
    intermediate_size = 2048, # 3072,
    layer_norm_eps = 1e-12,
    max_position_embeddings = 512,
    model_type = 'albert',
    net_structure_type = 0,
    num_attention_heads = 8, # 12
    num_hidden_groups = 1,
    num_hidden_layers = 8, # 12
    num_memory_blocks = 0,
    position_embedding_type = 'absolute',
    transformers_version = '4.22.2',
    type_vocab_size = 2,
    vocab_size = 10000, # 30000,
)
model = AutoModelForMaskedLM.from_config(config)

In [23]:
model.num_parameters()

4640400

In [24]:
model = model.to(device)

## 2.2 Load pre-trained token embedding matrix (optional)

[Table of content](#TOC)

In [25]:
wv = Word2Vec.load(os.path.join(path_to_save, final_model_name, 'w2v', 'sgram')).wv

In [26]:
# reindex rows in embedding table
base_id2w = {v: k for k, v in tokenizer.get_vocab().items()}
reindexing = [wv.key_to_index[base_id2w[i]] for i in range(len(base_id2w))]

In [27]:
reindexing[:5]

[9997, 9996, 9995, 9999, 9998]

In [28]:
# for i, reind_i in enumerate(reindexing[:5]):
#     print(reind_i)
#     print(wv.index_to_key[reind_i])
#     print(base_id2w[i])

In [29]:
token_embeddings = wv.vectors[reindexing] # wv.get_normed_vectors()[reindexing]
token_embeddings.shape

(10000, 128)

In [30]:
model.albert.embeddings.word_embeddings._parameters['weight'].norm(dim = -1).tolist()

[0.0,
 0.2158610224723816,
 0.21945318579673767,
 0.19634011387825012,
 0.2107628732919693,
 0.24488210678100586,
 0.22859203815460205,
 0.23756606876850128,
 0.23459665477275848,
 0.26320382952690125,
 0.2537899613380432,
 0.2184302806854248,
 0.225273996591568,
 0.22849233448505402,
 0.24241963028907776,
 0.2524573504924774,
 0.21078689396381378,
 0.22890456020832062,
 0.23088496923446655,
 0.22887878119945526,
 0.2318890243768692,
 0.2187117636203766,
 0.2266083061695099,
 0.24758616089820862,
 0.21598289906978607,
 0.21874119341373444,
 0.22888371348381042,
 0.23187004029750824,
 0.2402448058128357,
 0.22504548728466034,
 0.21476301550865173,
 0.22808954119682312,
 0.2430512011051178,
 0.21919871866703033,
 0.22482645511627197,
 0.219651460647583,
 0.2158142328262329,
 0.23668523132801056,
 0.22299842536449432,
 0.2047591209411621,
 0.20640283823013306,
 0.23148047924041748,
 0.23868654668331146,
 0.20136794447898865,
 0.2118317037820816,
 0.21921397745609283,
 0.2153385579586029,


In [31]:
np.linalg.norm(token_embeddings, axis = -1).tolist()

[0.051318228244781494,
 0.05024795979261398,
 1.2570383548736572,
 0.052062246948480606,
 0.04846731945872307,
 1.4936128854751587,
 1.6605130434036255,
 1.7028515338897705,
 1.4862797260284424,
 1.7892918586730957,
 1.794647216796875,
 1.6289222240447998,
 1.631468415260315,
 2.0154943466186523,
 2.1520259380340576,
 1.8849847316741943,
 2.0168864727020264,
 1.7805209159851074,
 1.6925245523452759,
 2.0719501972198486,
 1.9361293315887451,
 1.8657211065292358,
 2.301326036453247,
 1.967976689338684,
 2.19937801361084,
 2.3560495376586914,
 1.908614993095398,
 2.1233270168304443,
 2.317613124847412,
 2.2105886936187744,
 2.2597129344940186,
 1.949370265007019,
 1.9564261436462402,
 2.2539544105529785,
 2.0055251121520996,
 2.1048340797424316,
 1.923978328704834,
 1.9290603399276733,
 2.2513070106506348,
 1.9028393030166626,
 2.147373914718628,
 2.165884494781494,
 2.0926897525787354,
 2.213029623031616,
 2.0426077842712402,
 2.1423630714416504,
 2.433199405670166,
 2.362342119216919,
 

In [32]:
model.albert.embeddings.word_embeddings = model.albert.embeddings.word_embeddings.from_pretrained(torch.tensor(token_embeddings), padding_idx = tokenizer._pad_token_type_id)

## 2.3 Model training

[Table of content](#TOC)

`Albert-vase-v2` training parameters as provided in https://github.com/google-research/albert/blob/master/run_pretraining.py : 
- max_predictions_per_seq = `20`
- train_batch_size = `4096`
- optimizer = `"lamb"`
- learning_rate = `0.00176`
- poly_power = `1.0`
- num_train_steps = `125000`
- num_warmup_steps = `3125`
- start_warmup_step = `0`
- iterations_per_loop = `1000`

The original optimizer is `lamb`, which was designed for very large batch size, see the [Lamb paper](https://arxiv.org/pdf/1904.00962.pdf), but we use here the default [AdamW](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.AdamW) optimizer with [linear learning rate decay](https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/optimizer_schedules#transformers.get_linear_schedule_with_warmup), as specified in the [Trainer class documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.optimizers). See the [AdamW paper](https://arxiv.org/pdf/1711.05101.pdf).

In [33]:
batch_size = 12 # <= 15 for GPU RAM = 8GB

In [34]:
model = model.train()

In [35]:
args = TrainingArguments(
    os.path.join(path_to_save, '_checkpoints'),
    evaluation_strategy = "no",
    learning_rate = 5e-4,
    num_train_epochs = 2, # 3
    warmup_steps = 1500,
    gradient_accumulation_steps = 1,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    save_strategy = 'no',
    logging_steps = 100,
    seed = 42,
    data_seed = 23,
)

In [36]:
trainer = Trainer(
    model,
    args,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

Some remarks:

- The `data_collator` is the object used to batch elements of the training & evaluation datasets.
- The `tokenizer` is provided in order to automatically pad the inputs to the maximum length when batching inputs, and to have it saved along the model, which makes it easier to rerun an interrupted training or reuse the fine-tuned model.

In [37]:
torch.cuda.empty_cache()

In [38]:
# lr = 5e-4, bs = 12
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 430458
  Num Epochs = 2
  Instantaneous batch size per device = 12
  Total train batch size (w. parallel, distributed & accumulation) = 12
  Gradient Accumulation steps = 1
  Total optimization steps = 71744
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,8.9753
200,8.3946
300,7.5239
400,6.6595
500,6.238
600,6.1867
700,6.1586
800,6.1325
900,6.0819
1000,6.0472




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=71744, training_loss=2.6739787279123926, metrics={'train_runtime': 18202.4886, 'train_samples_per_second': 47.297, 'train_steps_per_second': 3.941, 'total_flos': 1.2098621094690816e+16, 'train_loss': 2.6739787279123926, 'epoch': 2.0})

In [41]:
# lr = 5e-4, bs = 16
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 471374
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 88383
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,8.2933
200,7.7022
300,6.8749
400,6.151
500,5.9347
600,5.9097
700,5.8714
800,5.8398
900,5.8018
1000,5.7355




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=88383, training_loss=2.3243570542808394, metrics={'train_runtime': 21557.1365, 'train_samples_per_second': 65.599, 'train_steps_per_second': 4.1, 'total_flos': 1.0942092842876928e+16, 'train_loss': 2.3243570542808394, 'epoch': 3.0})

In [39]:
model = model.to('cpu')

In [40]:
model.save_pretrained(os.path.join(path_to_save, final_model_name, 'model'))

Configuration saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\albert-small-clinical-trials\model\config.json
Model weights saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\albert-small-clinical-trials\model\pytorch_model.bin


<a id="inference"></a>

# 3. Inference

[Table of content](#TOC)

In [32]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))
model = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model'))

loading file spiece.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\clinical-trials-albert-small\model\config.json
Model config AlbertConfig {
  "_name_or_path": "C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\MLM\\clinical-trials-albert-small\\model",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.05,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.05,
  "hidden_size": 384,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 1536,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 8,
  "n

In [41]:
mlm = pipeline(
    task = 'fill-mask', 
    model = model, 
    tokenizer = tokenizer,
    framework = 'pt',
)

In [42]:
sent = 'Polyneuropathy of other causes, including but not limited to hereditary demyelinating neuropathies, neuropathies secondary to infection or systemic disease, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor neuropathy, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'
sent = f'Polyneuropathy of other causes, including but not limited to {mlm.tokenizer.mask_token} demyelinating neuropathies,  {mlm.tokenizer.mask_token} secondary to infection or systemic {mlm.tokenizer.mask_token}, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor {mlm.tokenizer.mask_token}, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'
mlm(sent, top_k = 5)

[[{'score': 0.16992007195949554,
   'token': 99,
   'token_str': 'other',
   'sequence': '[CLS] polyneuropathy of other causes, including but not limited to other demyelinating neuropathies,[MASK] secondary to infection or systemic[MASK], diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor[MASK], monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory cidp and acquired demyelinating symmetric (dads) neuropathy (also known as distal cidp).[SEP]'},
  {'score': 0.07117640227079391,
   'token': 508,
   'token_str': 'systemic',
   'sequence': '[CLS] polyneuropathy of other causes, including but not limited to systemic demyelinating neuropathies,[MASK] secondary to infection or systemic[MASK], diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor[MASK], monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory cidp and acquired demyelinating symmetric (dads) neu

[Table of content](#TOC)