<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Sequence Labelling - MLM
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Finetuning on Clinical Trials (Albert-base)
  </div> 


  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE - Hybrid Intelligence
  </div> 

  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Dataset](#data) <br>
2. [ALBERT finetuning](#albert) <br>
3. [Inference](#inference) <br>



#### Reference

- Hugginface full list of [tutorial notebooks](https://github.com/huggingface/transformers/tree/main/notebooks) (see also [here](https://huggingface.co/docs/transformers/main/notebooks#pytorch-examples))
- Huggingface full list of [training scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch)
- Huggingface [tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb) on language models
- Huggingface [course](https://huggingface.co/course/chapter7/3?fw=tf) on language models
- Huggingface [training script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) on language models
- Albert [original training protocol](https://github.com/google-research/albert)

In [11]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
import os
import sys
import re
import random
import copy
import string
from itertools import chain

# data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import (
    Dataset, 
    DatasetDict,
    ClassLabel, 
    Features, 
    Sequence, 
    Value,
)
from transformers import AlbertConfig, AutoConfig, DataCollatorForLanguageModeling

# DL
import torch
import transformers
from transformers import (
    AutoTokenizer, 
    AutoModelForMaskedLM, 
    TrainingArguments, 
    Trainer,
    pipeline,
    set_seed,
)
import evaluate

# viz
from IPython.display import HTML

#### Transformers settings

In [13]:
transformers.__version__

'4.22.2'

In [14]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [15]:
# make training deterministic
set_seed(42)

#### Custom paths & imports

In [16]:
path_to_repo = os.path.dirname(os.path.dirname(os.getcwd()))
path_to_data = os.path.join(path_to_repo, 'datasets')
path_to_save = os.path.join(path_to_repo, 'saves', 'MLM')
path_to_src  = os.path.join(path_to_repo, 'src')

In [17]:
sys.path.insert(0, path_to_src)

#### Constants

In [18]:
dataset_name = 'clinical-trials-ctti'
base_model_name = "albert-base-v2"
final_model_name = "albert-base-clinical-trials"

<a id="data"></a>

# 1. Dataset

[Table of content](#TOC)

We generate a collection of instances of the `datasets.Dataset` class. 

Note that these are different from the fairly generic `torch.utils.data.Dataset` class. 

## 1.1 Load Clinical Trials corpus

[Table of content](#TOC)

In [19]:
df_trials = pd.read_csv(os.path.join(path_to_data, '{}.tsv'.format(dataset_name)), sep = "\t")
df_trials = df_trials.fillna('')


df_trials.head(10)

Unnamed: 0,Id,Summary,Description,IE_criteria,Condition,Purpose,Intervention
0,NCT0000xxxx/NCT00000102.xml,This study will test the ability of extended r...,This protocol is designed to assess both acute...,diagnosed with Congenital Adrenal Hyperplasia ...,Congenital Adrenal Hyperplasia,Treatment,Nifedipine
1,NCT0000xxxx/NCT00000104.xml,Inner city children are at an increased risk f...,,,Lead Poisoning,,ERP measures of attention and memory
2,NCT0000xxxx/NCT00000105.xml,The purpose of this study is to learn how the ...,Patients will receive each vaccine once only c...,Patients must have a diagnosis of cancer of an...,Cancer,,Intracel KLH Vaccine
3,NCT0000xxxx/NCT00000106.xml,Recently a non-toxic system for whole body hyp...,,,Rheumatic Diseases,Treatment,Whole body hyperthermia unit
4,NCT0000xxxx/NCT00000107.xml,Adults with cyanotic congenital heart disease ...,,Resting blood pressure below 140/90,"Heart Defects, Congenital",,
5,NCT0000xxxx/NCT00000108.xml,The purpose of this research is to find out wh...,,Postmenopausal and preferably on hormone repla...,Cardiovascular Diseases,Prevention,Exercise
6,NCT0000xxxx/NCT00000110.xml,The purpose of this pilot investigation is to ...,,Healthy volunteers (developmental phase)\nHeal...,Obesity,Treatment,magnetic resonance spectroscopy
7,NCT0000xxxx/NCT00000111.xml,The purpose of this study is to see if we can ...,,Lack sufficient attached keratinized tissue at...,Mouth Diseases,Treatment,Oral mucosal graft
8,NCT0000xxxx/NCT00000112.xml,The prevalence of obesity in children is reach...,,Obesity: BM +/- 95% for age general good health,Obesity,,
9,NCT0000xxxx/NCT00000113.xml,To evaluate whether progressive addition lense...,Myopia (nearsightedness) is an important publi...,,Myopia,Treatment,Progressive Addition Lenses


In [20]:
df_trials.shape

(430108, 7)

In [21]:
# texts = df_trials[['Summary', 'Description', 'IE_criteria']].values.tolist()
# texts = [t for ts in texts for t in ts if len(t.strip())>=50]
texts = df_trials.IE_criteria.tolist()
texts = [t for t in texts if len(t.strip())>=50]
len(texts)

325793

In [22]:
dataset = Dataset.from_dict(
    {'text': texts}, 
    features = Features({'text': Value(dtype = 'string')}),
)

In [23]:
dataset[:3]

{'text': ['diagnosed with Congenital Adrenal Hyperplasia (CAH)\nnormal ECG during baseline evaluation\nhistory of liver disease, or elevated liver function tests\nhistory of cardiovascular disease',
  'Patients must have a diagnosis of cancer of any histologic type.\nPatients must have a Karnofsky performance status great or equal to 70%.\nPatients must have an expected survival for at least four months.\nNormal healthy volunteers to serve as control for this study.\nThe occurrence of any type of neurologic symptoms to tetanus vaccine in th past.\nPatients with a history of seafood allergy are excluded from receiving KLH.',
  'Postmenopausal and preferably on hormone replacement therapy\nIn good general health\nHave a body mass index (BMI, weight in kg/height in m2) of between 25 and 40\nExercise less than 20 min/day two days a week']}

## 1.2 Build Albert-base tokenizer

[Table of content](#TOC)


In [25]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, base_model_name, 'tokenizer'))

In [26]:
texts[0], tokenizer.decode(tokenizer(dataset[0]["text"])['input_ids'])

('diagnosed with Congenital Adrenal Hyperplasia (CAH)\nnormal ECG during baseline evaluation\nhistory of liver disease, or elevated liver function tests\nhistory of cardiovascular disease',
 '[CLS] diagnosed with congenital adrenal hyperplasia (cah) normal ecg during baseline evaluation history of liver disease, or elevated liver function tests history of cardiovascular disease[SEP]')

In [27]:
tokenizer.save_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))

('C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\MLM\\albert-base-clinical-trials\\tokenizer\\tokenizer_config.json',
 'C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\MLM\\albert-base-clinical-trials\\tokenizer\\special_tokens_map.json',
 'C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\MLM\\albert-base-clinical-trials\\tokenizer\\tokenizer.json')

## 1.3 Tokenize corpus

[Table of content](#TOC)


In [None]:
# We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it receives the `special_tokens_mask`.

def tokenize_text(examples, tokenizer):
    # Remove empty lines
    examples['text'] = [
        t for t in examples['text'] if len(t) > 0 and not t.isspace()
    ]
    return tokenizer(examples["text"], return_special_tokens_mask = True)

In [None]:
tokenized_dataset = dataset.map(lambda examples: tokenize_text(examples, tokenizer), batched = True, remove_columns = ["text"])

By contrast to the generic BIO annotated data, this new data depends on the tokenizer, and is therefore _model-specific_.

_Note_: the argument `remove_columns = ["text"]` is mandatory, in order to have each item of the dataset have same length.

In [None]:
print(tokenized_dataset[0])

## 1.4 Form blocks of constant length

[Table of content](#TOC)


In [None]:
# block_size = tokenizer.model_max_length
block_size = 512

In [None]:
def group_texts(examples, block_size):
    # Concatenate all texts.
    keys = [k for k in examples.keys() if k != 'text']
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[keys[0]])
    
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    
    # Split by chunks of max_len.
    result = {
        k: [t[i : i+block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
mlm_dataset = tokenized_dataset.map(
    lambda examples: group_texts(examples, block_size),
    batched = True,
)

In [None]:
texts[0], texts[1]

In [None]:
tokenizer.decode(mlm_dataset[0]["input_ids"]), tokenizer.decode(mlm_dataset[0]["labels"])

In [None]:
print(mlm_dataset[0])

In [None]:
len(mlm_dataset)

<a id="albert"></a>

# 2. ALBERT-base finetuning

[Table of content](#TOC)

## 2.1 Build Albert-base model

[Table of content](#TOC)

In [None]:
model = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, final_model_name, 'model')).to(device)

In [None]:
model.num_parameters()

## 2.2. Model finetuning

[Table of content](#TOC)

`Albert-vase-v2` training parameters as provided in https://github.com/google-research/albert/blob/master/run_pretraining.py : 
- max_predictions_per_seq = `20`
- train_batch_size = `4096`
- optimizer = `"lamb"`
- learning_rate = `0.00176`
- poly_power = `1.0`
- num_train_steps = `125000`
- num_warmup_steps = `3125`
- start_warmup_step = `0`
- iterations_per_loop = `1000`

The original optimizer is `lamb`, which was designed for very large batch size, see the [Lamb paper](https://arxiv.org/pdf/1904.00962.pdf), but we use here the default [AdamW](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.AdamW) optimizer with [linear learning rate decay](https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/optimizer_schedules#transformers.get_linear_schedule_with_warmup), as specified in the [Trainer class documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.optimizers). See the [AdamW paper](https://arxiv.org/pdf/1711.05101.pdf).

In [None]:
batch_size = 8

In [None]:
model = model.train()

In [None]:
args = TrainingArguments(
    os.path.join(path_to_save, f"{dataset_name}-{model_name}-finetuned"),
    evaluation_strategy = "no",
    learning_rate = 5e-6, # >= 2e-5 makes the model change too fast, in eg 500 steps
    # max_steps = 5000,
    num_train_epochs = 1,
    warmup_steps = 500,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    save_strategy = 'no',
    logging_steps = 100,
    weight_decay = 5e-6,
    seed = 42,
)

In [None]:
trainer = Trainer(
    model,
    args,
    data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm_probability = 0.15),
    train_dataset = mlm_dataset,
)

Some remarks:

- The `data_collator` is the object used to batch elements of the training & evaluation datasets.
- The `tokenizer` is provided in order to automatically pad the inputs to the maximum length when batching inputs, and to have it saved along the model, which makes it easier to rerun an interrupted training or reuse the fine-tuned model.

In [None]:
trainer.train()

In [33]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `AlbertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 219732
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 6867
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
100,5.3373
200,3.8184
300,3.593
400,3.4373
500,3.3635
600,3.312
700,3.2351
800,3.2274
900,3.1704
1000,3.1655


KeyboardInterrupt: 

In [34]:
model = model.to('cpu')

In [35]:
model.save_pretrained(os.path.join(path_to_save, f"{dataset_name}-{model_name}", 'model'))

Configuration saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\clinical-trials-albert-base\model\config.json
Model weights saved in C:\Users\jb\Desktop\NLP\Internal - Transformers for NLP\saves\MLM\clinical-trials-albert-base\model\pytorch_model.bin


<a id="inference"></a>

# 3. Inference

[Table of content](#TOC)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, f"{dataset_name}-{model_name}", 'tokenizer'))
model = AutoModelForMaskedLM.from_pretrained(os.path.join(path_to_save, f"{dataset_name}-{model_name}", 'model'))

In [36]:
mlm = pipeline(
    task = 'fill-mask', 
    model = model, 
    tokenizer = tokenizer,
    framework = 'pt',
)

In [37]:
sent = 'Polyneuropathy of other causes, including but not limited to hereditary demyelinating neuropathies, neuropathies secondary to infection or systemic disease, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor neuropathy, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'
sent = f'Polyneuropathy of other causes, including but not limited to {mlm.tokenizer.mask_token} demyelinating neuropathies,  {mlm.tokenizer.mask_token} secondary to infection or systemic {mlm.tokenizer.mask_token}, diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor {mlm.tokenizer.mask_token}, monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory CIDP and acquired demyelinating symmetric (DADS) neuropathy (also known as distal CIDP).'
mlm(sent, top_k = 5)

[[{'score': 0.399431973695755,
   'token': 15,
   'token_str': ',',
   'sequence': '[CLS] polyneuropathy of other causes, including but not limited to, demyelinating neuropathies,[MASK] secondary to infection or systemic[MASK], diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor[MASK], monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory cidp and acquired demyelinating symmetric (dads) neuropathy (also known as distal cidp).[SEP]'},
  {'score': 0.030957819893956184,
   'token': 45,
   'token_str': ':',
   'sequence': '[CLS] polyneuropathy of other causes, including but not limited to: demyelinating neuropathies,[MASK] secondary to infection or systemic[MASK], diabetic neuropathy, drug- or toxin-induced neuropathies, multifocal motor[MASK], monoclonal gammopathy of uncertain significance, lumbosacral radiculoplexus neuropathy, pure sensory cidp and acquired demyelinating symmetric (dads) neuropathy (also known as dis

[Table of content](#TOC)