<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Language Modeling
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Encoder pretraining using Masked Language Modeling task
  </div> 


  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE - Hybrid Intelligence
  </div> 

  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Dataset](#data) <br>
2. [ALBERT finetuning](#albert) <br>
3. [Inference](#inference) <br>



#### Reference

- Hugginface full list of [tutorial notebooks](https://github.com/huggingface/transformers/tree/main/notebooks) (see also [here](https://huggingface.co/docs/transformers/main/notebooks#pytorch-examples))
- Huggingface full list of [training scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch)
- Huggingface [tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb) on language models
- Huggingface [course](https://huggingface.co/course/chapter7/3?fw=tf) on language models
- Huggingface [training script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) on language models
- Albert [original training protocol](https://github.com/google-research/albert)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import re
import random
import copy
import string
from itertools import chain

# data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import (
    Dataset,  
    DatasetDict,
    ClassLabel, 
    Features, 
    Sequence, 
    Value,
    load_from_disk,
)
from transformers import AlbertConfig, AutoConfig, DataCollatorForLanguageModeling

# DL
import torch
from gensim.models import Word2Vec
import transformers
from transformers import (
    AutoTokenizer, 
    AutoModel,
    AutoModelForMaskedLM, 
    TrainingArguments, 
    Trainer,
    pipeline,
    set_seed,
)
import evaluate

# viz
from IPython.display import HTML

  from .autonotebook import tqdm as notebook_tqdm


#### Transformers settings

In [3]:
transformers.__version__

'4.22.2'

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [5]:
# make training deterministic
set_seed(42)

#### Custom paths & imports

In [6]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_data = os.path.join(path_to_repo, 'datasets', 'clinical trials ICTRP')
path_to_save = os.path.join(path_to_repo, 'saves', 'MLM')
path_to_src  = os.path.join(path_to_repo, 'src')

In [7]:
sys.path.insert(0, path_to_src)

#### Constants

In [8]:
dataset_name = 'clinical-trials-ictrp'
final_dataset_name = 'clinical-trials-ictrp-tokenized-blocks'
base_model_name = "albert-base-v2"
final_model_name = "albert-small-ictrp-debug"

<a id="data"></a>

# 1. Dataset

[Table of content](#TOC)

We generate a collection of instances of the `datasets.Dataset` class. 

Note that these are different from the fairly generic `torch.utils.data.Dataset` class. 

## 1.1 Load Clinical Trials corpus

[Table of content](#TOC)

In [9]:
with open(os.path.join(path_to_data, '{}.txt'.format(dataset_name)), 'r', encoding = 'utf-8') as f:
    texts = [t.strip() for t in f.readlines()[:1500]]

In [10]:
dataset = Dataset.from_dict({'text': texts}, features = Features({'text': Value(dtype = 'string')}))

In [11]:
len(dataset)

1500

In [12]:
dataset[:3]

{'text': ["Capable of giving signed informed consent. Has received, been intolerant to, or been ineligible for all treatment options proven, to confer clinical benefit. Measurable disease per Response Evaluation Criteria in Solid Tumors (RECIST) 1.1. Eastern Cooperative Oncology Group (ECOG) Performance status (PS) of 0 or 1. Adequate organ function. Male individuals and female individuals of childbearing potential who engage in, heterosexual intercourse must agree to use methods of contraception. Female participants are eligible if they are not pregnant, not breastfeeding or not a, Woman of childbearing potential (WOCBP). Inclusion criterion for the dose-escalation: Individuals with histologically or, cytologically confirmed, advanced or metastatic solid tumors. Inclusion criterion for disease-specific combination expansion: Individuals with, histologically or cytologically confirmed Triple-negative breast cancer (TNBC), Non-small cell lung cancer (NSCLC), Head and neck squamous cell 

## 1.2 Load Clinical-Albert-small tokenizer

[Table of content](#TOC)

In [13]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))

## 1.3 Tokenize corpus

[Table of content](#TOC)

In [14]:
# We use this option because DataCollatorForLanguageModeling (see below) is more efficient 
# when it receives the `special_tokens_mask`.
def tokenize_text(examples, tokenizer):
    # Remove empty lines
    examples['text'] = [
        t for t in examples['text'] if len(t) > 0 and not t.isspace()
    ]
    return tokenizer(examples["text"], return_special_tokens_mask = True)

In [15]:
tokenized_dataset = dataset.map(
    lambda examples: tokenize_text(examples, tokenizer), 
    batched = True, 
    remove_columns = ["text"],
)

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.77ba/s]


By contrast to the generic BIO annotated data, this new data depends on the tokenizer, and is therefore _model-specific_.

_Note_: the argument `remove_columns = ["text"]` is mandatory, in order to have each item of the dataset have same length.

In [16]:
len(tokenized_dataset[0]['input_ids'])

740

In [17]:
tokenized_dataset[0]

{'input_ids': [2,
  857,
  8,
  1377,
  343,
  90,
  78,
  7,
  72,
  166,
  6,
  172,
  2225,
  13,
  6,
  10,
  172,
  1311,
  20,
  159,
  56,
  3610,
  1085,
  6,
  13,
  8934,
  103,
  1432,
  7,
  483,
  42,
  196,
  515,
  473,
  96,
  21,
  881,
  586,
  5,
  12,
  2191,
  11,
  5,
  18,
  7,
  18,
  7,
  826,
  734,
  702,
  337,
  5,
  12,
  682,
  11,
  335,
  203,
  5,
  12,
  84,
  15,
  11,
  8,
  5,
  92,
  10,
  5,
  18,
  7,
  261,
  423,
  176,
  7,
  189,
  621,
  17,
  137,
  621,
  8,
  242,
  139,
  54,
  2721,
  21,
  6,
  1784,
  1257,
  69,
  309,
  13,
  82,
  402,
  8,
  188,
  7,
  137,
  209,
  59,
  222,
  5,
  27,
  39,
  348,
  59,
  60,
  123,
  6,
  60,
  482,
  10,
  60,
  5,
  14,
  6,
  738,
  8,
  242,
  139,
  5,
  12,
  1961,
  11,
  7,
  215,
  966,
  20,
  9,
  118,
  16,
  10937,
  38,
  621,
  19,
  531,
  10,
  6,
  1129,
  213,
  6,
  524,
  10,
  399,
  881,
  586,
  7,
  215,
  966,
  20,
  42,
  16,
  1049,
  737,
  3269,
  38,
  621,
  

## 1.4 Form blocks of constant length

[Table of content](#TOC)


In [18]:
def group_texts(examples, block_size):
    # Concatenate all texts.
    keys = [k for k in examples.keys() if k != 'text']
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[keys[0]])
    
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    
    # Split by chunks of max_len.
    result = {
        k: [t[i : i+block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [19]:
block_size = 512

In [20]:
# run only once
mlm_dataset = tokenized_dataset.map(lambda examples: group_texts(examples, block_size), batched = True)

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.12ba/s]


In [22]:
len(mlm_dataset)

1189

In [23]:
print(mlm_dataset[0])

{'input_ids': [2, 857, 8, 1377, 343, 90, 78, 7, 72, 166, 6, 172, 2225, 13, 6, 10, 172, 1311, 20, 159, 56, 3610, 1085, 6, 13, 8934, 103, 1432, 7, 483, 42, 196, 515, 473, 96, 21, 881, 586, 5, 12, 2191, 11, 5, 18, 7, 18, 7, 826, 734, 702, 337, 5, 12, 682, 11, 335, 203, 5, 12, 84, 15, 11, 8, 5, 92, 10, 5, 18, 7, 261, 423, 176, 7, 189, 621, 17, 137, 621, 8, 242, 139, 54, 2721, 21, 6, 1784, 1257, 69, 309, 13, 82, 402, 8, 188, 7, 137, 209, 59, 222, 5, 27, 39, 348, 59, 60, 123, 6, 60, 482, 10, 60, 5, 14, 6, 738, 8, 242, 139, 5, 12, 1961, 11, 7, 215, 966, 20, 9, 118, 16, 10937, 38, 621, 19, 531, 10, 6, 1129, 213, 6, 524, 10, 399, 881, 586, 7, 215, 966, 20, 42, 16, 1049, 737, 3269, 38, 621, 19, 6, 531, 10, 1129, 213, 3825, 16, 1761, 249, 114, 5, 12, 11975, 11, 6, 122, 16, 1998, 177, 380, 114, 5, 12, 4487, 11, 6, 764, 17, 900, 601, 177, 224, 5, 12, 12998, 11, 6, 10, 6, 1335, 7, 215, 966, 20, 9, 2168, 16, 66, 30, 14, 3269, 38, 621, 19, 531, 6, 10, 1129, 213, 1226, 7, 691, 17, 112, 5, 39, 22, 14, 1

In [24]:
print(tokenizer.decode(mlm_dataset[0]["input_ids"]), tokenizer.decode(mlm_dataset[0]["labels"]))

[CLS] capable of giving signed informed consent. has received, been intolerant to, or been ineligible for all treatment options proven, to confer clinical benefit. measurable disease per response evaluation criteria in solid tumors (recist) 1.1. eastern cooperative oncology group (ecog) performance status (ps) of 0 or 1. adequate organ function. male individuals and female individuals of childbearing potential who engage in, heterosexual intercourse must agree to use methods of contraception. female participants are eligible if they are not pregnant, not breastfeeding or not a, woman of childbearing potential (wocbp). inclusion criterion for the dose-escalation: individuals with histologically or, cytologically confirmed, advanced or metastatic solid tumors. inclusion criterion for disease-specific combination expansion: individuals with, histologically or cytologically confirmed triple-negative breast cancer (tnbc), non-small cell lung cancer (nsclc), head and neck squamous cell carci

<a id="albert"></a>

# 2. ALBERT-small training

[Table of content](#TOC)


In [25]:
batch_size = 16

args = TrainingArguments(
    os.path.join(path_to_save, '_checkpoints'),
    evaluation_strategy = "no",
    learning_rate = 1e-4,
    num_train_epochs = 5,
    warmup_steps = 0,
    gradient_accumulation_steps = 1,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    save_strategy = 'no',
    logging_steps = 100,
    seed = 42,
    data_seed = 23,
)

# Test 1

[Table of content](#TOC)

In [24]:
# # a smaller Albert config
config = AlbertConfig(
    attention_probs_dropout_prob = 0,
    pad_token_id = 0,
    bos_token_id = 2,
    eos_token_id = 3,
    classifier_dropout_prob = 0.1,
    down_scale_factor = 1,
    embedding_size = 128,
    gap_size = 0,
    hidden_act = 'gelu_new',
    hidden_dropout_prob = 0,
    hidden_size = 512, # 768,
    initializer_range = 0.02,
    inner_group_num = 1,
    intermediate_size = 2048, # 3072,
    layer_norm_eps = 1e-12,
    max_position_embeddings = 512,
    model_type = 'albert',
    net_structure_type = 0,
    num_attention_heads = 8, # 12
    num_hidden_groups = 1,
    num_hidden_layers = 8, # 12
    num_memory_blocks = 0,
    position_embedding_type = 'absolute',
    transformers_version = '4.22.2',
    type_vocab_size = 2,
    vocab_size = 15000, # 30000,
    tie_word_embeddings = True,
    
)
model = AutoModelForMaskedLM.from_config(config)

In [25]:
print(model.num_parameters(), sum(p.numel() for p in model.parameters() if p.requires_grad))

5285400 5285400


In [30]:
print(model.albert.embeddings.word_embeddings._parameters['weight'][1][:10])

print(model.predictions.decoder._parameters['weight'][1][:10])

tensor([-0.0016,  0.0184,  0.0004, -0.0014,  0.0032,  0.0114,  0.0409, -0.0202,
         0.0127, -0.0184], grad_fn=<SliceBackward0>)
tensor([-0.0016,  0.0184,  0.0004, -0.0014,  0.0032,  0.0114,  0.0409, -0.0202,
         0.0127, -0.0184], grad_fn=<SliceBackward0>)


In [135]:
for param in model.albert.embeddings.parameters():
    param.requires_grad = False
    
print(model.num_parameters(), sum(p.numel() for p in model.parameters() if p.requires_grad))

5285400 3299352


In [143]:
for param in model.predictions.decoder.parameters():
    print(param)
    print(param.requires_grad)
    
print(model.num_parameters(), sum(p.numel() for p in model.parameters() if p.requires_grad))

Parameter containing:
tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [-1.8205e-02, -2.9258e-02, -1.0633e-02,  ..., -1.6441e-02,
          2.4761e-02, -2.2937e-02],
        [ 1.4343e-02,  9.3004e-03,  1.5389e-02,  ..., -4.4360e-03,
         -1.3656e-02,  1.2855e-02],
        ...,
        [-2.6421e-02,  1.0300e-04, -1.3071e-02,  ..., -1.7174e-02,
          3.8985e-03,  1.7478e-02],
        [-1.0843e-02,  1.5037e-02,  3.3565e-02,  ...,  1.8532e-05,
          1.3063e-02,  9.7028e-03],
        [-2.0104e-03,  7.3787e-03, -2.8310e-02,  ...,  5.6448e-04,
         -2.5573e-03,  1.7190e-02]])
False
Parameter containing:
tensor([0., 0., 0.,  ..., 0., 0., 0.], requires_grad=True)
True
5285400 3299352


In [133]:
model

AlbertForMaskedLM(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(15000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=512, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=512, out_features=512, bias=True)
                (key): Linear(in_features=512, out_features=512, bias=True)
                (value): Linear(in_features=512, out_features=512, bias=True)
  

In [126]:
help(AutoModelForMaskedLM.from_pretrained)

Help on method from_pretrained in module transformers.models.auto.auto_factory:

from_pretrained(*model_args, **kwargs) method of builtins.type instance
    Instantiate one of the model classes of the library (with a masked language modeling head) from a pretrained model.
    
    The model class to instantiate is selected based on the `model_type` property of the config object (either
    passed as an argument or loaded from `pretrained_model_name_or_path` if possible), or when it's missing, by
    falling back to using pattern matching on `pretrained_model_name_or_path`:
    
        - **albert** -- [`AlbertForMaskedLM`] (ALBERT model)
        - **bart** -- [`BartForConditionalGeneration`] (BART model)
        - **bert** -- [`BertForMaskedLM`] (BERT model)
        - **big_bird** -- [`BigBirdForMaskedLM`] (BigBird model)
        - **camembert** -- [`CamembertForMaskedLM`] (CamemBERT model)
        - **convbert** -- [`ConvBertForMaskedLM`] (ConvBERT model)
        - **data2vec-text** -

In [19]:
model_test_1 = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model'))
model_test_1 = model_test_1.to(device)#.train()

print(model_test_1.num_parameters(), sum(p.numel() for p in model_test_1.parameters() if p.requires_grad))

7205400 7205400


In [111]:
help(model_test_1.albert.embeddings.word_embeddings.from_pretrained)

Help on Embedding in module torch.nn.modules.sparse object:

class Embedding(torch.nn.modules.module.Module)
 |  Embedding(num_embeddings: int, embedding_dim: int, padding_idx: Union[int, NoneType] = None, max_norm: Union[float, NoneType] = None, norm_type: float = 2.0, scale_grad_by_freq: bool = False, sparse: bool = False, _weight: Union[torch.Tensor, NoneType] = None, device=None, dtype=None) -> None
 |  
 |  A simple lookup table that stores embeddings of a fixed dictionary and size.
 |  
 |  This module is often used to store word embeddings and retrieve them using indices.
 |  The input to the module is a list of indices, and the output is the corresponding
 |  word embeddings.
 |  
 |  Args:
 |      num_embeddings (int): size of the dictionary of embeddings
 |      embedding_dim (int): the size of each embedding vector
 |      padding_idx (int, optional): If specified, the entries at :attr:`padding_idx` do not contribute to the gradient;
 |                                   ther

In [104]:
model_test_1

AlbertForMaskedLM(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(15000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=512, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=512, out_features=512, bias=True)
                (key): Linear(in_features=512, out_features=512, bias=True)
                (value): Linear(in_features=512, out_features=512, bias=True)
  

In [23]:
print(model_test_1.albert.embeddings.word_embeddings._parameters['weight'][0][:10])

print(model_test_1.predictions.decoder._parameters['weight'][0][:10])

tensor([ 0.0045,  0.0063, -0.0051,  0.0001, -0.0014,  0.0069, -0.0052, -0.0015,
        -0.0007, -0.0018], device='cuda:0', grad_fn=<SliceBackward0>)
tensor([-0.2928, -0.2450, -0.2708, -0.2755, -0.3077, -0.2549, -0.1111, -0.2621,
        -0.0125,  0.0154], device='cuda:0', grad_fn=<SliceBackward0>)


In [21]:
model_test_1.albert.embeddings.word_embeddings._parameters['weight'].shape, model_test_1.predictions.decoder._parameters['weight'].shape

(torch.Size([15000, 128]), torch.Size([15000, 128]))

In [106]:
model_test_1.albert.embeddings.word_embeddings._parameters['weight'].norm(dim = -1).tolist()

[0.05029847100377083,
 0.04985176399350166,
 0.04812698811292648,
 0.05405382812023163,
 0.05249904468655586,
 1.7227307558059692,
 1.839365005493164,
 1.9135942459106445,
 2.096674919128418,
 1.9929710626602173,
 1.8200165033340454,
 1.8425917625427246,
 2.1484005451202393,
 2.278388261795044,
 1.9871857166290283,
 2.1129376888275146,
 2.4063568115234375,
 1.9516832828521729,
 2.530545473098755,
 2.0679008960723877,
 2.5778214931488037,
 2.2181642055511475,
 2.1268270015716553,
 2.284655809402466,
 2.182380199432373,
 2.623713254928589,
 2.613213539123535,
 2.0911061763763428,
 2.2124714851379395,
 1.9270639419555664,
 2.1363606452941895,
 2.244401216506958,
 2.4619646072387695,
 2.2495040893554688,
 2.8109073638916016,
 2.169473886489868,
 2.06711483001709,
 2.4892654418945312,
 2.5276334285736084,
 3.243286609649658,
 3.1745095252990723,
 2.8597164154052734,
 2.6179184913635254,
 2.607361078262329,
 2.1849875450134277,
 2.2061665058135986,
 1.842813491821289,
 2.382831573486328,
 2.

In [107]:
wv = Word2Vec.load(os.path.join(path_to_save, final_model_name, 'w2v', 'sgram')).wv

# reindex rows in embedding table
base_id2w = {v: k for k, v in tokenizer.get_vocab().items()}
reindexing = [wv.key_to_index[base_id2w[i]] for i in range(len(base_id2w))]

token_embeddings = wv.vectors[reindexing] # wv.get_normed_vectors()[reindexing]
token_embeddings.shape

model_test_1.albert.embeddings.word_embeddings = model_test_1.albert.embeddings.word_embeddings.from_pretrained(
    torch.FloatTensor(token_embeddings), 
    padding_idx = tokenizer._pad_token_type_id,
)

print(model_test_1.num_parameters(), sum(p.numel() for p in model_test_1.parameters() if p.requires_grad))

7205400 5285400


In [108]:
model_test_1

AlbertForMaskedLM(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(15000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=512, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=512, out_features=512, bias=True)
                (key): Linear(in_features=512, out_features=512, bias=True)
                (value): Linear(in_features=512, out_features=512, bias=True)
  

In [98]:
model_test_1.albert.embeddings.word_embeddings._parameters['weight'][0][:10]

tensor([ 0.0045,  0.0063, -0.0051,  0.0001, -0.0014,  0.0069, -0.0052, -0.0015,
        -0.0007, -0.0018])

In [86]:
model_test_1.albert.embeddings.word_embeddings._parameters['weight'].norm(dim = -1).tolist()

[0.05029847100377083,
 0.04985176399350166,
 0.04812698811292648,
 0.05405383184552193,
 0.05249904468655586,
 1.7227306365966797,
 1.839365005493164,
 1.913594126701355,
 2.096674919128418,
 1.9929710626602173,
 1.8200165033340454,
 1.8425917625427246,
 2.1484007835388184,
 2.278388261795044,
 1.9871857166290283,
 2.1129376888275146,
 2.4063568115234375,
 1.9516832828521729,
 2.530545473098755,
 2.0679008960723877,
 2.5778214931488037,
 2.2181644439697266,
 2.1268270015716553,
 2.284655809402466,
 2.182379961013794,
 2.6237130165100098,
 2.6132137775421143,
 2.0911061763763428,
 2.2124714851379395,
 1.927064061164856,
 2.1363604068756104,
 2.244401216506958,
 2.4619646072387695,
 2.2495038509368896,
 2.8109073638916016,
 2.169473886489868,
 2.06711483001709,
 2.4892654418945312,
 2.5276334285736084,
 3.243286609649658,
 3.1745095252990723,
 2.8597164154052734,
 2.6179184913635254,
 2.607361078262329,
 2.1849875450134277,
 2.2061665058135986,
 1.8428136110305786,
 2.382831573486328,
 2

## 2.3 Model training

[Table of content](#TOC)



In [87]:
trainer = Trainer(
    model_test_1,
    args,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1189
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Step,Training Loss
100,8.6834
200,3.6522
300,2.9058




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=4.609824584960937, metrics={'train_runtime': 173.6627, 'train_samples_per_second': 34.233, 'train_steps_per_second': 2.159, 'total_flos': 95325909688320.0, 'train_loss': 4.609824584960937, 'epoch': 5.0})

In [88]:
model_test_1 = model_test_1.to('cpu')

In [89]:
model_test_1 .save_pretrained(os.path.join(path_to_save, final_model_name, 'model-test-1'))

Configuration saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-1\config.json
Model weights saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-1\pytorch_model.bin


In [90]:
model_test_1_dev = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model-test-1'))
model_test_1_dev = model_test_1_dev.to(device)

loading configuration file C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-1\config.json
Model config AlbertConfig {
  "_name_or_path": "C:\\Users\\jb\\Desktop\\NLP\\perso - Transformers for NLP\\saves\\MLM\\albert-small-ictrp-debug\\model-test-1",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 8,
  "num_hidden_groups": 1,
  "num_hidden_layers": 8,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",

In [91]:
model_test_1_dev.albert.embeddings.word_embeddings._parameters['weight'].norm(dim = -1).tolist()

[0.05029847100377083,
 0.04985176399350166,
 0.04812698811292648,
 0.05405382812023163,
 0.05249904468655586,
 1.7227307558059692,
 1.839365005493164,
 1.9135942459106445,
 2.096674919128418,
 1.9929710626602173,
 1.8200165033340454,
 1.8425917625427246,
 2.1484005451202393,
 2.278388261795044,
 1.9871857166290283,
 2.1129376888275146,
 2.4063568115234375,
 1.9516832828521729,
 2.530545473098755,
 2.0679008960723877,
 2.5778214931488037,
 2.2181642055511475,
 2.1268270015716553,
 2.284655809402466,
 2.182380199432373,
 2.623713254928589,
 2.613213539123535,
 2.0911061763763428,
 2.2124714851379395,
 1.9270639419555664,
 2.1363606452941895,
 2.244401216506958,
 2.4619646072387695,
 2.2495040893554688,
 2.8109073638916016,
 2.169473886489868,
 2.06711483001709,
 2.4892654418945312,
 2.5276334285736084,
 3.243286609649658,
 3.1745095252990723,
 2.8597164154052734,
 2.6179184913635254,
 2.607361078262329,
 2.1849875450134277,
 2.2061665058135986,
 1.842813491821289,
 2.382831573486328,
 2.

In [92]:
trainer = Trainer(
    model_test_1_dev,
    args,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1189
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Step,Training Loss
100,2.6736
200,2.2381


KeyboardInterrupt: 

In [None]:
model_test_1_dev.albert.embeddings.word_embeddings._parameters['weight'].norm(dim = -1).tolist()

# Test 2

In [81]:
model_test_2 = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model'))
model_test_2 = model_test_2.to(device)#.train()

loading configuration file C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model\config.json
Model config AlbertConfig {
  "_name_or_path": "C:\\Users\\jb\\Desktop\\NLP\\perso - Transformers for NLP\\saves\\MLM\\albert-small-ictrp-debug\\model",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 8,
  "num_hidden_groups": 1,
  "num_hidden_layers": 8,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transforme

In [82]:
model_test_2.albert.embeddings.word_embeddings._parameters['weight'].norm(dim = -1).tolist()

[0.05029847100377083,
 0.04985176399350166,
 0.04812698811292648,
 0.05405382812023163,
 0.05249904468655586,
 1.7227307558059692,
 1.839365005493164,
 1.9135942459106445,
 2.096674919128418,
 1.9929710626602173,
 1.8200165033340454,
 1.8425917625427246,
 2.1484005451202393,
 2.278388261795044,
 1.9871857166290283,
 2.1129376888275146,
 2.4063568115234375,
 1.9516832828521729,
 2.530545473098755,
 2.0679008960723877,
 2.5778214931488037,
 2.2181642055511475,
 2.1268270015716553,
 2.284655809402466,
 2.182380199432373,
 2.623713254928589,
 2.613213539123535,
 2.0911061763763428,
 2.2124714851379395,
 1.9270639419555664,
 2.1363606452941895,
 2.244401216506958,
 2.4619646072387695,
 2.2495040893554688,
 2.8109073638916016,
 2.169473886489868,
 2.06711483001709,
 2.4892654418945312,
 2.5276334285736084,
 3.243286609649658,
 3.1745095252990723,
 2.8597164154052734,
 2.6179184913635254,
 2.607361078262329,
 2.1849875450134277,
 2.2061665058135986,
 1.842813491821289,
 2.382831573486328,
 2.

In [51]:
trainer = Trainer(
    model_test_2,
    args,
    tokenizer = tokenizer,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1189
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Step,Training Loss
100,8.689
200,3.6654
300,2.9168




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=4.620098510742188, metrics={'train_runtime': 178.0556, 'train_samples_per_second': 33.388, 'train_steps_per_second': 2.106, 'total_flos': 60260872888320.0, 'train_loss': 4.620098510742188, 'epoch': 5.0})

In [54]:
model_test_2 = model_test_2.to('cpu')

model_test_2.save_pretrained(os.path.join(path_to_save, final_model_name, 'model-test-2'))

Configuration saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-2\config.json
Model weights saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-2\pytorch_model.bin


In [55]:
model_test_2_dev = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model-test-2'))
model_test_2_dev = model_test_2_dev.to(device)

loading configuration file C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-2\config.json
Model config AlbertConfig {
  "_name_or_path": "C:\\Users\\jb\\Desktop\\NLP\\perso - Transformers for NLP\\saves\\MLM\\albert-small-ictrp-debug\\model-test-2",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 8,
  "num_hidden_groups": 1,
  "num_hidden_layers": 8,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",

In [57]:
trainer = Trainer(
    model_test_2_dev,
    args,
    tokenizer = tokenizer,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1189
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Step,Training Loss
100,2.5146
200,2.1424
300,2.062




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=2.201224405924479, metrics={'train_runtime': 180.6695, 'train_samples_per_second': 32.905, 'train_steps_per_second': 2.076, 'total_flos': 60260872888320.0, 'train_loss': 2.201224405924479, 'epoch': 5.0})

# Test 3

In [116]:
model_test_3 = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model'))


wv = Word2Vec.load(os.path.join(path_to_save, final_model_name, 'w2v', 'sgram')).wv

# reindex rows in embedding table
base_id2w = {v: k for k, v in tokenizer.get_vocab().items()}
reindexing = [wv.key_to_index[base_id2w[i]] for i in range(len(base_id2w))]

token_embeddings = wv.vectors[reindexing] # wv.get_normed_vectors()[reindexing]
token_embeddings.shape

model_test_3.albert.embeddings.word_embeddings = model_test_1.albert.embeddings.word_embeddings.from_pretrained(
    torch.FloatTensor(token_embeddings), 
    padding_idx = tokenizer._pad_token_type_id,
) # load word embeddings in encoder, but untie it from MLM head as a byproduct !

print(model_test_3.num_parameters(), sum(p.numel() for p in model_test_3.parameters() if p.requires_grad))

A = model_test_3.albert.embeddings.word_embeddings._parameters['weight'][0][:10]


# re-tie word embedding with MLM head, and unfreeze it from training
model_test_3.save_pretrained(os.path.join(path_to_save, final_model_name, 'model-test-3'))
model_test_3 = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model-test-3'))
model_test_3 = model_test_3.to(device)#.train()

B = model_test_3.albert.embeddings.word_embeddings._parameters['weight'][0][:10]
print(A)
print(B)


print(model_test_3.num_parameters(), sum(p.numel() for p in model_test_3.parameters() if p.requires_grad))

loading configuration file C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model\config.json
Model config AlbertConfig {
  "_name_or_path": "C:\\Users\\jb\\Desktop\\NLP\\perso - Transformers for NLP\\saves\\MLM\\albert-small-ictrp-debug\\model",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 8,
  "num_hidden_groups": 1,
  "num_hidden_layers": 8,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transforme

7205400 5285400
tensor([ 0.0045,  0.0063, -0.0051,  0.0001, -0.0014,  0.0069, -0.0052, -0.0015,
        -0.0007, -0.0018])
tensor([ 0.0045,  0.0063, -0.0051,  0.0001, -0.0014,  0.0069, -0.0052, -0.0015,
        -0.0007, -0.0018], device='cuda:0', grad_fn=<SliceBackward0>)
5285400 5285400


In [117]:
trainer = Trainer(
    model_test_3,
    args,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1189
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Step,Training Loss
100,8.689
200,3.6654
300,2.9168




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=4.620098510742188, metrics={'train_runtime': 178.6394, 'train_samples_per_second': 33.279, 'train_steps_per_second': 2.099, 'total_flos': 60260872888320.0, 'train_loss': 4.620098510742188, 'epoch': 5.0})

In [118]:
model_test_3 = model_test_3.to('cpu')

model_test_3.save_pretrained(os.path.join(path_to_save, final_model_name, 'model-test-3-dev'))

Configuration saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-3-dev\config.json
Model weights saved in C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-3-dev\pytorch_model.bin


In [119]:
model_test_3_dev = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model-test-3-dev'))
model_test_3_dev = model_test_3_dev.to(device)

loading configuration file C:\Users\jb\Desktop\NLP\perso - Transformers for NLP\saves\MLM\albert-small-ictrp-debug\model-test-3-dev\config.json
Model config AlbertConfig {
  "_name_or_path": "C:\\Users\\jb\\Desktop\\NLP\\perso - Transformers for NLP\\saves\\MLM\\albert-small-ictrp-debug\\model-test-3-dev",
  "architectures": [
    "AlbertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 2048,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 8,
  "num_hidden_groups": 1,
  "num_hidden_layers": 8,
  "num_memory_blocks": 0,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "f

In [120]:
trainer = Trainer(
    model_test_3_dev,
    args,
    tokenizer = tokenizer,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1189
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Step,Training Loss
100,2.5146
200,2.1424
300,2.062




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=2.201224405924479, metrics={'train_runtime': 177.7452, 'train_samples_per_second': 33.447, 'train_steps_per_second': 2.11, 'total_flos': 60260872888320.0, 'train_loss': 2.201224405924479, 'epoch': 5.0})

[Table of content](#TOC)