<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Named Entity Recognition
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Encoder pretraining using simple Discriminator protocol
  </div> 


  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE - Hybrid Intelligence
  </div> 


  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Dataset](#data) <br>
2. [Albert finetuning](#albert) <br>
3. [Inference](#inference) <br>


#### Content

This is an implementation of the idea developped in [Frustratingly Simple Pretraining Alternatives
to Masked Language Modeling, 2021](https://arxiv.org/pdf/2109.01819.pdf).<br>
It is a dummy version of Electra pretraining objective.



#### Reference

- Hugginface full list of [tutorial notebooks](https://github.com/huggingface/transformers/tree/main/notebooks) (see also [here](https://huggingface.co/docs/transformers/main/notebooks#pytorch-examples))
- Huggingface full list of [training scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch)
- Huggingface [tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb) on token classification
- Huggingface [course](https://huggingface.co/course/chapter7/2?fw=tf) on token classification
- Huggingface [training script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py) on token classification

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import re
import random
import copy
import string

# data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import (
    Dataset, 
    DatasetDict,
    ClassLabel, 
    Features, 
    Sequence, 
    Value,
    load_from_disk,
)
from transformers import DataCollatorForTokenClassification

# DL
import torch
import transformers
from transformers import (
    AutoTokenizer, 
    AutoModelForTokenClassification, 
    TrainingArguments, 
    Trainer,
    pipeline,
    set_seed,
)
import evaluate

# viz
from IPython.display import HTML

  from .autonotebook import tqdm as notebook_tqdm


#### Transformers settings

In [3]:
transformers.__version__

'4.22.2'

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [5]:
# make training deterministic
set_seed(42)

#### Custom paths & imports

In [6]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_data = os.path.join(path_to_repo, 'datasets', 'clinical trials CTTI')
path_to_save_mlm = os.path.join(path_to_repo, 'saves', 'MLM')
path_to_save_ner = os.path.join(path_to_repo, 'saves', 'NER')
path_to_src  = os.path.join(path_to_repo, 'src')

In [7]:
sys.path.insert(0, path_to_src)

In [8]:
from nlptools.ner.preprocessing import tokenize_and_align_categories, create_labels
from nlptools.ner.metrics import compute_metrics
from nlptools.ner.postprocessing import parse_trf_ner_output, remove_entity_overlaps, correct_entity_boundaries
from nlptools.ner.visualization import render_ner_as_html

#### Constants

In [9]:
dataset_name = 'clinical-trials-ctti'
base_model_name = "albert-small-clinical-trials"
final_model_name = "albert-small-clinical-trials-randomshuffle"

<a id="data"></a>

# 1. Clinical Trial corpus dataset

[Table of content](#TOC)


In [10]:
lm_dataset = load_from_disk(os.path.join(path_to_data, 'tmp', dataset_name))

In [11]:
lm_dataset[0]

{'input_ids': [2,
  34,
  30,
  16,
  129,
  6,
  621,
  7,
  2773,
  1772,
  5,
  31,
  17,
  37,
  39,
  17,
  1903,
  202,
  2829,
  811,
  17,
  9,
  5,
  82,
  42,
  110,
  5,
  9,
  85,
  278,
  358,
  10,
  13,
  201,
  879,
  5,
  9,
  725,
  14,
  6,
  118,
  7,
  4009,
  358,
  200,
  515,
  13,
  1085,
  2315,
  3955,
  587,
  2153,
  17,
  9,
  227,
  9,
  48,
  67,
  3,
  2,
  34,
  281,
  24,
  586,
  13,
  175,
  194,
  255,
  12,
  186,
  156,
  7,
  6,
  2122,
  3839,
  3097,
  10,
  5,
  31,
  17,
  37,
  39,
  17,
  1903,
  10,
  40,
  6,
  983,
  155,
  53,
  9,
  1636,
  22,
  36,
  1288,
  17,
  15,
  438,
  22,
  9,
  23,
  114,
  1628,
  3673,
  14,
  29,
  19,
  2315,
  3955,
  587,
  2153,
  17,
  9,
  11,
  6,
  1927,
  121,
  24,
  944,
  2668,
  7,
  117,
  154,
  8,
  12,
  16,
  2672,
  5,
  9,
  1760,
  228,
  1121,
  1931,
  467,
  11,
  6,
  631,
  7,
  154,
  5,
  17,
  24,
  13,
  675,
  6,
  621,
  7,
  5,
  31,
  17,
  37,
  39,
  17,
  1903,
  193

<a id="albert"></a>

# 2. ALBERT finetuning

[Table of content](#TOC)

## 2.1 ALBERT model for Token Classification

[Table of content](#TOC)

In [12]:
class_labels = ['Shuffle', 'Random', 'Original']
class_labels = ClassLabel(names = class_labels)

label2id = class_labels._str2int
id2label = {i: l for l, i in label2id.items()}

In [13]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save_mlm, base_model_name, 'tokenizer'))

In [14]:
model = AutoModelForTokenClassification.from_pretrained(
    os.path.join(path_to_save_mlm, base_model_name, 'model'),
    num_labels = len(class_labels.names),
    label2id = label2id,
    id2label = id2label,
).to(device)

Some weights of the model checkpoint at C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\albert-small-clinical-trials\model were not used when initializing AlbertForTokenClassification: ['predictions.bias', 'predictions.LayerNorm.bias', 'predictions.dense.bias', 'predictions.dense.weight', 'predictions.LayerNorm.weight', 'predictions.decoder.bias', 'predictions.decoder.weight']
- This IS expected if you are initializing AlbertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForTokenClassification were not initialized from the model checkpoint at C:\Users\jb\Desk

In [15]:
model.num_parameters()

2531203

## 2.2 Data collator for NER-based encoder pretraining

[Table of content](#TOC)

In [16]:
import random
import math
import warnings
from collections.abc import Mapping
from dataclasses import dataclass
from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union


from transformers import PreTrainedTokenizerBase
from transformers.data.data_collator import PaddingStrategy, DataCollatorMixin, _torch_collate_batch

In [17]:
# see https://github.com/huggingface/transformers/blob/v4.23.1/src/transformers/data/data_collator.py#L264
# see https://github.com/gucci-j/light-transformer-emnlp2021/blob/dfa2fbfaf9293c39e40843c6905fac0f5c992d81/src/model/data_collator.py#L89

@dataclass
class CustomDataCollatorForTokenClassification(DataCollatorMixin):
    """
    Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they
    are not all of the same length.
    Args:
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        mlm (`bool`, *optional*, defaults to `True`):
            Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs
            with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked
            tokens and the value to predict for the masked token.
        mlm_probability (`float`, *optional*, defaults to 0.15):
            The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.
        pad_to_multiple_of (`int`, *optional*):
            If set will pad the sequence to a multiple of the provided value.
        return_tensors (`str`):
            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
    <Tip>
    For best performance, this data collator should be used with a dataset having items that are dictionaries or
    BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a
    [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.
    </Tip>"""

    tokenizer: PreTrainedTokenizerBase
    task_proportions: tuple = (1, 1, 8)    # shuffle | random | keep
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    label_pad_token_id: int = -100
    return_tensors: str = "pt"


    def __post_init__(self):
        import torch
        
        # Ensures that entries in task_proportions sums to 1
        self.task_proportions = tuple(abs(v) for v in self.task_proportions)
        if sum(self.task_proportions) > 0:
            tot = sum(self.task_proportions)
            self.task_proportions = torch.tensor(tuple(v/tot for v in self.task_proportions))
            print(self.task_proportions)
        else:
            raise ValueError('"task_proportions" sum of entites should be positive"')

        
    def tf_call(self, *args, **kwargs):
        raise NotImplementedError("This data collator is Pytorch-only")


    def numpy_call(self, *args, **kwargs):
        raise NotImplementedError("This data collator is Pytorch-only")
    

    def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
        batch = self.tokenizer.pad(
            examples, 
            padding = self.padding,
            max_length = self.max_length,
            return_tensors = self.return_tensors, 
            pad_to_multiple_of = self.pad_to_multiple_of,
        )
        batch["input_ids"], batch["labels"] = self.create_labels(batch["input_ids"])
        return batch


    def create_labels(self, inputs):
        """
        Prepare inputs and labels for custom NER task for encoder pretraining.
        """
        
        # reconstruct the special_tokens_mask which is lost by the call of "tokenizer.pad"
        special_tokens_mask = [
            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens = True) 
            for val in inputs.tolist()
        ]
        special_tokens_mask = torch.tensor(special_tokens_mask, dtype = torch.bool)
        
        # We partition tokens into [shuffle | randomize | keep] subgroups
        # through random sampling according to proportions given in self.task_proportions
        labels = torch.multinomial(self.task_proportions, math.prod(inputs.shape), replacement = True).reshape(inputs.shape)
        
        # we force special_tokens to be kept, eg no shuffling or random replacement
        labels[special_tokens_mask] = 2

        # we shuffle part of tokens
        perm_matrix = (labels == 0).bool()
        perm_tokens = inputs[:, torch.randperm(inputs.size()[1])] # row-wise shuffle
        inputs[perm_matrix] = perm_tokens[perm_matrix]

        # we replace part of tokens by random ones
        rand_matrix = (labels == 1).bool()
        rand_tokens = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[rand_matrix] = rand_tokens[rand_matrix]
        
        # we keep part of tokens as-is
        # keep_matrix = (labels == 2).bool()
        # inputs[keep_matrix] = inputs[keep_matrix]
        
        # Avoid special tokens for loss computation
        labels[special_tokens_mask] = -100 
        
        return (inputs, labels)

## 2.3 Finetuning

[Table of content](#TOC)

In [18]:
batch_size = 16

In [19]:
model = model.train()

In [20]:
args = TrainingArguments(
    os.path.join(path_to_save_ner, '_checkpoints'),
    evaluation_strategy = "no",
    learning_rate = 5e-4,
    num_train_epochs = 2,
    warmup_steps = 2000,
    gradient_accumulation_steps = 1,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    save_strategy = 'no',
    logging_steps = 100,
    seed = 42,
    data_seed = 23,
)

In [21]:
trainer = Trainer(
    model,
    args,
    data_collator = CustomDataCollatorForTokenClassification(tokenizer, task_proportions = (1, 1, 8)),
    train_dataset = lm_dataset,
)

tensor([0.1000, 0.1000, 0.8000])


In [22]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForTokenClassification.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 421077
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 52636
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,0.6629
200,0.4085
300,0.3701
400,0.3523
500,0.341
600,0.3334
700,0.3272
800,0.3214
900,0.3192
1000,0.3158




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=52636, training_loss=0.18639850055694002, metrics={'train_runtime': 17393.4407, 'train_samples_per_second': 48.418, 'train_steps_per_second': 3.026, 'total_flos': 4722515482503168.0, 'train_loss': 0.18639850055694002, 'epoch': 2.0})

In [23]:
model = model.to('cpu')

In [24]:
tokenizer.save_pretrained(os.path.join(path_to_save_ner, final_model_name, 'tokenizer'))
model.save_pretrained(os.path.join(path_to_save_ner, final_model_name, 'model'))

tokenizer config file saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\NER\albert-small-clinical-trials-randomshuffle\tokenizer\tokenizer_config.json
Special tokens file saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\NER\albert-small-clinical-trials-randomshuffle\tokenizer\special_tokens_map.json
Configuration saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\NER\albert-small-clinical-trials-randomshuffle\model\config.json
Model weights saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\NER\albert-small-clinical-trials-randomshuffle\model\pytorch_model.bin


<a id="inference"></a>

# 3. Inference

[Table of content](#TOC)

In [30]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save_ner, final_model_name, 'tokenizer'))
model = AutoModelForTokenClassification.from_pretrained(os.path.join(path_to_save_ner, final_model_name, 'model'))

loading file spiece.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\NER\albert-small-clinical-trials-randomshuffle\model\config.json
Model config AlbertConfig {
  "_name_or_path": "C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\NER\\albert-small-clinical-trials-randomshuffle\\model",
  "architectures": [
    "AlbertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.05,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0.05,
  "hidden_size": 384,
  "id2label": {
    "0": "Shuffle",
    "1": "Random",
    "2": "Original"
  },
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 1536,
  "label2id": {
    "Origin

In [31]:
ner = pipeline(
    task = 'ner', 
    model = model, 
    tokenizer = tokenizer,
    framework = 'pt',
    aggregation_strategy = 'simple',
)

In [32]:
sent = 'Polyneuropathy of other causes, including but not limited to hereditary demyelinating neuropathies, neuropathies secondary to infection or systemic disease, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor neuropathy, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'

df_ents = parse_trf_ner_output(ner(sent))
df_ents = correct_entity_boundaries(sent, df_ents)
df_ents = remove_entity_overlaps(df_ents)
HTML(render_ner_as_html(sent, df_ents))

In [38]:
sent = 'Polyneuropathy of other causes, including but not limited to hereditary demyelinating neuropathies, neuropathies secondary to infection or systemic disease, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor neuropathy, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'

df_ents = parse_trf_ner_output(ner(sent))
df_ents = correct_entity_boundaries(sent, df_ents)
df_ents = remove_entity_overlaps(df_ents)
HTML(render_ner_as_html(sent, df_ents))

In [38]:
sent = 'Polyneuropathy of other causes, including but not limited to hereditary demyelinating neuropathies, neuropathies secondary to infection or systemic disease, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor neuropathy, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'

df_ents = parse_trf_ner_output(ner(sent))
df_ents = correct_entity_boundaries(sent, df_ents)
df_ents = remove_entity_overlaps(df_ents)
HTML(render_ner_as_html(sent, df_ents))

In [39]:
sent = 'Adults and adolescent patients with a physician diagnosis of asthma ' +\
    'for ≥12 months, based on the Global Initiative for Asthma (GINA) 2014 Guidelines ' +\
    'and the followingcriteria: A) Existing treatment with medium to high dose ICS ' +\
    '(≥250 mcg of fluticasone propionate twice daily or equipotent ICS daily dosage ' +\
    'to a maximum of 2000 mcg/day of fluticasone propionate or equivalent) in combination ' +\
    'with a second controller (eg, LABA, LTRA) for at least 3 months with a stable dose ≥1 ' +\
    'month prior to Visit 1.'

df_ents = parse_trf_ner_output(ner(sent))
df_ents = correct_entity_boundaries(sent, df_ents)
df_ents = remove_entity_overlaps(df_ents)
HTML(render_ner_as_html(sent, df_ents))

In [39]:
sent = 'Adults and adolescent patients with a physician diagnosis of asthma ' +\
    'for ≥12 months, based on the Global Initiative for Asthma (GINA) 2014 Guidelines ' +\
    'and the followingcriteria: A) Existing treatment with medium to high dose ICS ' +\
    '(≥250 mcg of fluticasone propionate twice daily or equipotent ICS daily dosage ' +\
    'to a maximum of 2000 mcg/day of fluticasone propionate or equivalent) in combination ' +\
    'with a second controller (eg, LABA, LTRA) for at least 3 months with a stable dose ≥1 ' +\
    'month prior to Visit 1.'

df_ents = parse_trf_ner_output(ner(sent))
df_ents = correct_entity_boundaries(sent, df_ents)
df_ents = remove_entity_overlaps(df_ents)
HTML(render_ner_as_html(sent, df_ents))

In [40]:
sent = 'Biologics treatment: Cell-depleting agents, eg. Rituximab Drug within 6 months before baseline or until lymphocyte count returns to normal, Other biologics: within 5 half-lives or 16 weeks prior baseline.'

df_ents = parse_trf_ner_output(ner(sent))
df_ents = correct_entity_boundaries(sent, df_ents)
df_ents = remove_entity_overlaps(df_ents)
HTML(render_ner_as_html(sent, df_ents))

In [40]:
sent = 'Biologics treatment: Cell-depleting agents, eg. Rituximab Drug within 6 months before baseline or until lymphocyte count returns to normal, Other biologics: within 5 half-lives or 16 weeks prior baseline.'

df_ents = parse_trf_ner_output(ner(sent))
df_ents = correct_entity_boundaries(sent, df_ents)
df_ents = remove_entity_overlaps(df_ents)
HTML(render_ner_as_html(sent, df_ents))

In [41]:
sents = '''Active hepatitis or patients with positive HBsAg, or patients with positive HBcAb plus positive HBV DNA, or positive HCV antibody (confirmed with presence of HCV RNA if needed) at screening.

History of alcohol or drug abuse within 2 years of the screening visit.

History of HIV infection or positive HIV serology at screening.

History of malignancy within 5 years before screening, except completely treated cervix carcinoma, completely treated and resolved non-metastatic squamous or basal cell carcinoma of the skin.

Infection requires systemic antibiotics, antivirals, antiparasitics, antiprotozoals, antifungals treatment within 2 weeks before baseline, superficial skin infections within 1 week before baseline visit.

Initiation AD treatment with prescription moisturizers or moisturizers containing  additives such as ceramide, hyaluronic acid, urea, or filaggrin degradation products during the screening period.

Known or suspected history of immunosuppression, including history of invasive opportunistic infections, despite infection resolution, or unusually frequent, recurrent, or prolonged infections.
'''
df_list = []
html = ''
for sent in sents.split('\n'):
    if sent.strip():
        ents = parse_trf_ner_output(ner(sent))
        ents = correct_entity_boundaries(sent, ents)
        ents = remove_entity_overlaps(ents)
        df_list.append(ents)
        html += render_ner_as_html(sent, ents)

df_ents = pd.concat(df_list, ignore_index = True)
HTML(html)

In [41]:
sents = '''Active hepatitis or patients with positive HBsAg, or patients with positive HBcAb plus positive HBV DNA, or positive HCV antibody (confirmed with presence of HCV RNA if needed) at screening.

History of alcohol or drug abuse within 2 years of the screening visit.

History of HIV infection or positive HIV serology at screening.

History of malignancy within 5 years before screening, except completely treated cervix carcinoma, completely treated and resolved non-metastatic squamous or basal cell carcinoma of the skin.

Infection requires systemic antibiotics, antivirals, antiparasitics, antiprotozoals, antifungals treatment within 2 weeks before baseline, superficial skin infections within 1 week before baseline visit.

Initiation AD treatment with prescription moisturizers or moisturizers containing  additives such as ceramide, hyaluronic acid, urea, or filaggrin degradation products during the screening period.

Known or suspected history of immunosuppression, including history of invasive opportunistic infections, despite infection resolution, or unusually frequent, recurrent, or prolonged infections.
'''
df_list = []
html = ''
for sent in sents.split('\n'):
    if sent.strip():
        ents = parse_trf_ner_output(ner(sent))
        ents = correct_entity_boundaries(sent, ents)
        ents = remove_entity_overlaps(ents)
        df_list.append(ents)
        html += render_ner_as_html(sent, ents)

df_ents = pd.concat(df_list, ignore_index = True)
HTML(html)

[Table of content](#TOC)