<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Language Modeling
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Encoder pretraining using custom Language Modeling task
  </div> 


  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE - Hybrid Intelligence
  </div> 

  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Dataset](#data) <br>
2. [ALBERT finetuning](#albert) <br>
3. [Inference](#inference) <br>



#### Reference

- Hugginface full list of [tutorial notebooks](https://github.com/huggingface/transformers/tree/main/notebooks) (see also [here](https://huggingface.co/docs/transformers/main/notebooks#pytorch-examples))
- Huggingface full list of [training scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch)
- Huggingface [tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb) on language models
- Huggingface [course](https://huggingface.co/course/chapter7/3?fw=tf) on language models
- Huggingface [training script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) on language models
- Albert [original training protocol](https://github.com/google-research/albert)

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os
import sys
import re
import random
import copy
import string
from itertools import chain

# data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import (
    Dataset, 
    DatasetDict,
    ClassLabel, 
    Features, 
    Sequence, 
    Value,
    load_from_disk,
)

# DL
import torch
import transformers
from transformers import (
    AutoTokenizer, 
    AutoModelForMaskedLM, 
    TrainingArguments, 
    Trainer,
    pipeline,
    set_seed,
)
import evaluate

# viz
from IPython.display import HTML

  from .autonotebook import tqdm as notebook_tqdm


#### Transformers settings

In [4]:
transformers.__version__

'4.22.2'

In [5]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [6]:
# make training deterministic
set_seed(42)

#### Custom paths & imports

In [7]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_data = os.path.join(path_to_repo, 'datasets', 'clinical trials CTTI')
path_to_save = os.path.join(path_to_repo, 'saves', 'MLM')
path_to_src  = os.path.join(path_to_repo, 'src')

In [8]:
sys.path.insert(0, path_to_src)

#### Constants

In [9]:
dataset_name = 'clinical-trials-ctti'
base_model_name = "albert-small-clinical-trials"
final_model_name = "albert-small-clinical-trials-dlm"

<a id="data"></a>

# 1. Dataset

[Table of content](#TOC)

We generate a collection of instances of the `datasets.Dataset` class. 

Note that these are different from the fairly generic `torch.utils.data.Dataset` class. 

## 1.1 Load Clinical Trials corpus

[Table of content](#TOC)

In [10]:
df_trials = pd.read_csv(os.path.join(path_to_data, '{}.tsv'.format(dataset_name)), sep = "\t")
df_trials = df_trials.fillna('')


df_trials.head(10)

Unnamed: 0,Id,Summary,Description,IE_criteria,Condition,Purpose,Intervention
0,NCT0000xxxx/NCT00000102.xml,This study will test the ability of extended r...,This protocol is designed to assess both acute...,diagnosed with Congenital Adrenal Hyperplasia ...,Congenital Adrenal Hyperplasia,Treatment,Nifedipine
1,NCT0000xxxx/NCT00000104.xml,Inner city children are at an increased risk f...,,,Lead Poisoning,,ERP measures of attention and memory
2,NCT0000xxxx/NCT00000105.xml,The purpose of this study is to learn how the ...,Patients will receive each vaccine once only c...,Patients must have a diagnosis of cancer of an...,Cancer,,Intracel KLH Vaccine
3,NCT0000xxxx/NCT00000106.xml,Recently a non-toxic system for whole body hyp...,,,Rheumatic Diseases,Treatment,Whole body hyperthermia unit
4,NCT0000xxxx/NCT00000107.xml,Adults with cyanotic congenital heart disease ...,,Resting blood pressure below 140/90,"Heart Defects, Congenital",,
5,NCT0000xxxx/NCT00000108.xml,The purpose of this research is to find out wh...,,Postmenopausal and preferably on hormone repla...,Cardiovascular Diseases,Prevention,Exercise
6,NCT0000xxxx/NCT00000110.xml,The purpose of this pilot investigation is to ...,,Healthy volunteers (developmental phase)\nHeal...,Obesity,Treatment,magnetic resonance spectroscopy
7,NCT0000xxxx/NCT00000111.xml,The purpose of this study is to see if we can ...,,Lack sufficient attached keratinized tissue at...,Mouth Diseases,Treatment,Oral mucosal graft
8,NCT0000xxxx/NCT00000112.xml,The prevalence of obesity in children is reach...,,Obesity: BM +/- 95% for age general good health,Obesity,,
9,NCT0000xxxx/NCT00000113.xml,To evaluate whether progressive addition lense...,Myopia (nearsightedness) is an important publi...,,Myopia,Treatment,Progressive Addition Lenses


In [11]:
# df_trials = df_trials.iloc[:1000,:]

In [12]:
df_trials.shape

(430108, 7)

In [13]:
texts = df_trials[['Summary', 'Description', 'IE_criteria']].values.tolist()
texts = [t.strip() for ts in texts for t in ts if len(t.strip())>=50]
# texts = df_trials.IE_criteria.tolist()
# texts = [t for t in texts if len(t.strip())>=50]
len(texts)

1041969

In [14]:
dataset = Dataset.from_dict({'text': texts}, features = Features({'text': Value(dtype = 'string')}))

In [15]:
dataset[:3]

{'text': ['This study will test the ability of extended release nifedipine (Procardia XL), a blood pressure medication, to permit a decrease in the dose of glucocorticoid medication children take to treat congenital adrenal hyperplasia (CAH).',
  'This protocol is designed to assess both acute and chronic effects of the calcium channel antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels, as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA axis? Such a decrease wo

## 1.2 Build Clinical-Albert-small tokenizer

[Table of content](#TOC)


In [16]:
def batch_iterator(dataset, batch_size = 512):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i: i + batch_size]['text']

In [17]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, base_model_name, 'tokenizer'))

## 1.3 Tokenize corpus

[Table of content](#TOC)


In [18]:
# We use this option because DataCollatorForLanguageModeling (see below) is more efficient 
# when it receives the `special_tokens_mask`.
def tokenize_text(examples, tokenizer):
    # Remove empty lines
    examples['text'] = [
        t for t in examples['text'] if len(t) > 0 and not t.isspace()
    ]
    return tokenizer(examples["text"], return_special_tokens_mask = True)

In [19]:
# tokenized_dataset = dataset.map(lambda examples: tokenize_text(examples, tokenizer), batched = True, remove_columns = ["text"])

By contrast to the generic BIO annotated data, this new data depends on the tokenizer, and is therefore _model-specific_.

_Note_: the argument `remove_columns = ["text"]` is mandatory, in order to have each item of the dataset have same length.

## 1.4 Form blocks of constant length

[Table of content](#TOC)


In [20]:
def group_texts(examples, block_size):
    # Concatenate all texts.
    keys = [k for k in examples.keys() if k != 'text']
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[keys[0]])
    
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    
    # Split by chunks of max_len.
    result = {
        k: [t[i : i+block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [21]:
# mlm_dataset = tokenized_dataset.map(lambda examples: group_texts(examples, block_size = 512), batched = True)
# mlm_dataset.save_to_disk(os.path.join(path_to_data, 'tmp', dataset_name))

In [22]:
lm_dataset = load_from_disk(os.path.join(path_to_data, 'tmp', dataset_name))

In [23]:
# texts[0], texts[1]

In [24]:
print(tokenizer.decode(lm_dataset[0]["input_ids"]), tokenizer.decode(lm_dataset[0]["labels"]))

[CLS] this study will test the ability of extended release nifedipine (procardia xl), a blood pressure medication, to permit a decrease in the dose of glucocorticoid medication children take to treat congenital adrenal hyperplasia (cah).[SEP][CLS] this protocol is designed to assess both acute and chronic effects of the calcium channel antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. the multicenter trial is composed of two phases and will involve a double-blind, placebo-controlled parallel design. the goal of phase i is to examine the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (acth) levels, as well as to begin to assess the dose-dependency of nifedipine effects. the goal of phase ii is to evaluate the long-term effects of nifedipine; that is, can attenuation of acth release by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the hpa axis? such a decrease wo

In [25]:
print(lm_dataset[0])

{'input_ids': [2, 34, 30, 16, 129, 6, 621, 7, 2773, 1772, 5, 31, 17, 37, 39, 17, 1903, 202, 2829, 811, 17, 9, 5, 82, 42, 110, 5, 9, 85, 278, 358, 10, 13, 201, 879, 5, 9, 725, 14, 6, 118, 7, 4009, 358, 200, 515, 13, 1085, 2315, 3955, 587, 2153, 17, 9, 227, 9, 48, 67, 3, 2, 34, 281, 24, 586, 13, 175, 194, 255, 12, 186, 156, 7, 6, 2122, 3839, 3097, 10, 5, 31, 17, 37, 39, 17, 1903, 10, 40, 6, 983, 155, 53, 9, 1636, 22, 36, 1288, 17, 15, 438, 22, 9, 23, 114, 1628, 3673, 14, 29, 19, 2315, 3955, 587, 2153, 17, 9, 11, 6, 1927, 121, 24, 944, 2668, 7, 117, 154, 8, 12, 16, 2672, 5, 9, 1760, 228, 1121, 1931, 467, 11, 6, 631, 7, 154, 5, 17, 24, 13, 675, 6, 621, 7, 5, 31, 17, 37, 39, 17, 1903, 1935, 228, 13, 725, 5, 9, 23, 114, 31, 20, 3050, 75, 20, 3350, 1179, 277, 32, 155, 28, 276, 10, 38, 206, 38, 13, 2434, 13, 175, 6, 118, 22, 2194, 1194, 7, 5, 31, 17, 37, 39, 17, 1903, 156, 11, 6, 631, 7, 154, 5, 17, 17, 24, 13, 127, 6, 754, 156, 7, 5, 31, 17, 37, 39, 17, 1903, 71, 46, 24, 10, 100, 41, 752, 61,

In [26]:
len(lm_dataset)

421077

<a id="albert"></a>

# 2. ALBERT-small training

[Table of content](#TOC)

#### Tested combinations

- 1.4M parameter model: converges fast (1 epoch) towards confusion score~=2.2. Issue : Finetuning of NER on Chia hard, stuck to high training error and/or provides evaluation errors
- 3.5M parameter model: gets stuck at confusion score~=5.9. Training args : block_size = 512, bs = 16, lr = 1e-4, grad_acc_step = 4, warmup_step = 500, num_layer = 8

## 2.1 Build Clinical-Albert-small model

[Table of content](#TOC)

In [27]:
model = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, base_model_name, 'model'))

In [28]:
model.num_parameters()

2584584

In [29]:
model = model.to(device)

## 2.2 Data collator for denoising language model

[Table of content](#TOC)

In [30]:
import random
import math
import warnings
from collections.abc import Mapping
from dataclasses import dataclass
from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union


from transformers import PreTrainedTokenizerBase
from transformers.data.data_collator import DataCollatorMixin, _torch_collate_batch

In [31]:
@dataclass
class CustomDataCollatorForLanguageModeling(DataCollatorMixin):
    """
    Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they
    are not all of the same length.
    Args:
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        mlm (`bool`, *optional*, defaults to `True`):
            Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs
            with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked
            tokens and the value to predict for the masked token.
        mlm_probability (`float`, *optional*, defaults to 0.15):
            The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.
        pad_to_multiple_of (`int`, *optional*):
            If set will pad the sequence to a multiple of the provided value.
        return_tensors (`str`):
            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
    <Tip>
    For best performance, this data collator should be used with a dataset having items that are dictionaries or
    BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a
    [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.
    </Tip>"""

    tokenizer: PreTrainedTokenizerBase
    task_proportions: tuple = (12, 1.5, 1.5, 85)    # mask | random noise | keep to learn | keep to ignore
    task_loss_weighting_coefs: tuple = (1., 1., 1.) # mask | random noise | keep to learn
    pad_to_multiple_of: Optional[int] = None
    return_tensors: str = "pt"


    def __post_init__(self):
        import torch
        
        # Ensures that entries in task_proportions sums to 1
        self.task_proportions = tuple(abs(v) for v in self.task_proportions)
        if sum(self.task_proportions) > 0:
            tot = sum(self.task_proportions)
            self.task_proportions = torch.tensor(tuple(v/tot for v in self.task_proportions))
            print(self.task_proportions)
        else:
            raise ValueError('"task_proportions" sum of entites should be positive"')

        
    def tf_call(self, *args, **kwargs):
        raise NotImplementedError("This data collator is Pytorch-only")


    def numpy_call(self, *args, **kwargs):
        raise NotImplementedError("This data collator is Pytorch-only")
    

    def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
        # Handle dict or lists with proper padding and conversion to tensor.
        if isinstance(examples[0], Mapping):
            batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
        else:
            batch = {
                "input_ids": _torch_collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
            }
        special_tokens_mask = batch.pop("special_tokens_mask", None)
        batch["input_ids"], batch["labels"] = self.torch_edit_tokens(
            inputs = batch["input_ids"],
            special_tokens_mask = special_tokens_mask,
        )
        return batch


    def torch_edit_tokens(self, inputs: Any, special_tokens_mask: Optional[Any] = None) -> Tuple[Any, Any]:
        """
        Prepare noisy tokens inputs/labels for denoising language modeling: 100% random.
        """
        # init target labels
        labels = inputs.clone()
        
        # get special tokens mask
        if special_tokens_mask is None:
            special_tokens_mask = [
                self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
            ]
            special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
        else:
            special_tokens_mask = special_tokens_mask.bool()
            
        # We split tokens into [mask | random noise | keep to learn | keep to ignore] subgroups
        # through random sampling according to proportions given in self.task_proportions
        distribution_matrix = torch.multinomial(self.task_proportions, math.prod(labels.shape), replacement = True).reshape(labels.shape)
        distribution_matrix.masked_fill_(special_tokens_mask, value = 3)
        
        mask_matrix = (distribution_matrix == 0).bool()
        rand_matrix = (distribution_matrix == 1).bool()
        keep_matrix = (distribution_matrix == 2).bool()
        ignr_matrix = (distribution_matrix == 3).bool()

        # we mask part of tokens
        inputs[mask_matrix] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

        # we replace part of tokens by random ones
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[rand_matrix] = random_words[rand_matrix]
        
        # we keep part of tokens as-is
        # inputs[keep_matrix] = inputs[keep_matrix]
        
        # We only compute loss on masked / randomized / learned tokens
        labels[ignr_matrix] = -100 
        return (inputs, labels) #, weights)

## 2.2. Model training

[Table of content](#TOC)

`Albert-vase-v2` training parameters as provided in https://github.com/google-research/albert/blob/master/run_pretraining.py : 
- max_predictions_per_seq = `20`
- train_batch_size = `4096`
- optimizer = `"lamb"`
- learning_rate = `0.00176`
- poly_power = `1.0`
- num_train_steps = `125000`
- num_warmup_steps = `3125`
- start_warmup_step = `0`
- iterations_per_loop = `1000`

The original optimizer is `lamb`, which was designed for very large batch size, see the [Lamb paper](https://arxiv.org/pdf/1904.00962.pdf), but we use here the default [AdamW](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.AdamW) optimizer with [linear learning rate decay](https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/optimizer_schedules#transformers.get_linear_schedule_with_warmup), as specified in the [Trainer class documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.optimizers). See the [AdamW paper](https://arxiv.org/pdf/1711.05101.pdf).

In [35]:
batch_size = 16

In [36]:
model = model.train()

In [40]:
args = TrainingArguments(
    os.path.join(path_to_save, '_checkpoints'),
    evaluation_strategy = "no",
    learning_rate = 5e-5,
    num_train_epochs = 1,
    warmup_steps = 1500,
    gradient_accumulation_steps = 1,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    save_strategy = 'no',
    logging_steps = 100,
    seed = 42,
    data_seed = 23,
)

In [84]:
trainer = Trainer(
    model,
    args,
    data_collator = CustomDataCollatorForLanguageModeling(
        tokenizer = tokenizer, 
        task_proportions = (2, 2, 2, 4),
    ),
    train_dataset = lm_dataset,
)

tensor([0.2000, 0.2000, 0.2000, 0.4000])


Some remarks:

- The `data_collator` is the object used to batch elements of the training & evaluation datasets.
- The `tokenizer` is provided in order to automatically pad the inputs to the maximum length when batching inputs, and to have it saved along the model, which makes it easier to rerun an interrupted training or reuse the fine-tuned model.

In [85]:
torch.cuda.empty_cache()

In [86]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 421077
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 26318


Step,Training Loss
100,3.1432
200,3.0738
300,3.0598
400,3.0486
500,3.0304
600,3.0269
700,3.0279
800,3.0134
900,3.0097
1000,2.9954


KeyboardInterrupt: 

In [29]:
# lr = 5e-4, bs = 16
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 421077
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 26318
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,8.3432
200,7.8738
300,7.2166
400,6.6463
500,6.4647
600,6.4271
700,6.3882
800,6.3372
900,6.3181
1000,6.2698




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=26318, training_loss=5.433946623266971, metrics={'train_runtime': 8344.1643, 'train_samples_per_second': 50.464, 'train_steps_per_second': 3.154, 'total_flos': 2430308656078848.0, 'train_loss': 5.433946623266971, 'epoch': 1.0})

In [31]:
# lr = 1e-3, bs = 8
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 421077
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 52635
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,8.2826
200,7.5547
300,6.7052
400,6.4918
500,6.4361
600,6.3876
700,6.3413
800,6.3087
900,6.3002
1000,6.2396


KeyboardInterrupt: 

In [29]:
# lr = 1e-3, bs = 16
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 421077
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 26318
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,8.2642
200,7.4874
300,6.6608
400,6.4428
500,6.4064
600,6.3602
700,6.3041
800,6.2411
900,6.2244
1000,6.1822


KeyboardInterrupt: 

In [25]:
# lr = 2e-3, bs = 16
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 421077
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 26318
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,8.1335
200,6.9558
300,6.4553
400,6.3898
500,6.3327
600,6.2727
700,6.2289
800,6.1799
900,6.17
1000,6.133


KeyboardInterrupt: 

In [37]:
# lr = 1e-4, bs = 16
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 421077
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 26318
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
20,8.5361
40,8.5102
60,8.4631
80,8.4007
100,8.3532
120,8.318
140,8.2864
160,8.2532
180,8.2259
200,8.186


KeyboardInterrupt: 

In [35]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 423456
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 8
  Total optimization steps = 3308
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
50,8.4792
100,8.3082
150,8.1946
200,8.0713
250,7.9173
300,7.7324
350,7.5348
400,7.3335
450,7.1371
500,6.9451


KeyboardInterrupt: 

In [24]:
model = model.to('cpu')

In [25]:
model.save_pretrained(os.path.join(path_to_save, final_model_name, 'model'))

Configuration saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\clinical-trials-albert-small\model\config.json
Model weights saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\clinical-trials-albert-small\model\pytorch_model.bin


<a id="inference"></a>

# 3. Inference

[Table of content](#TOC)

In [32]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))
model = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model'))

loading file spiece.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\clinical-trials-albert-small\model\config.json
Model config AlbertConfig {
  "_name_or_path": "C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\MLM\\clinical-trials-albert-small\\model",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.05,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.05,
  "hidden_size": 384,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 1536,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 8,
  "n

In [39]:
mlm = pipeline(
    task = 'fill-mask', 
    model = model, 
    tokenizer = tokenizer,
    framework = 'pt',
)

In [40]:
sent = 'Polyneuropathy of other causes, including but not limited to hereditary demyelinating neuropathies, neuropathies secondary to infection or systemic disease, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor neuropathy, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'
sent = f'Polyneuropathy of other causes, including but not limited to {mlm.tokenizer.mask_token} demyelinating neuropathies,  {mlm.tokenizer.mask_token} secondary to infection or systemic {mlm.tokenizer.mask_token}, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor {mlm.tokenizer.mask_token}, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'
mlm(sent, top_k = 5)

[[{'score': 0.2646515965461731,
   'token': 6,
   'token_str': 'the',
   'sequence': '[CLS] polyneuropathy of other causes, including but not limited to the demyelinating neuropathies,[MASK] secondary to infection or systemic[MASK], diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor[MASK], monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory cidp and acquired demyelinating symmetric (dads) neuropathy (also known as distal cidp).[SEP]'},
  {'score': 0.07012559473514557,
   'token': 105,
   'token_str': 'other',
   'sequence': '[CLS] polyneuropathy of other causes, including but not limited to other demyelinating neuropathies,[MASK] secondary to infection or systemic[MASK], diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor[MASK], monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory cidp and acquired demyelinating symmetric (dads) neuropathy (als

[Table of content](#TOC)