<a href="https://colab.research.google.com/github/georgianpartners/NLP-Domain-Adaptation/blob/master/notebooks/GuideToTransformersDomainAdaptation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
# !pip install -U pip
# !pip install --extra-index https://test.pypi.org/simple/ transformers-domain-adaptation
# !pip install seqeval

# Guide to Transformers Domain Adaptation
This guide illustrates an end-to-end workflow of domain adaptation, where we domain-adapt a transfomer model for biomedical NLP applications.

It showcases the two domain adaptation techniques we investigated in our research:
1. Data Selection
2. Vocabulary Augmentation

Following that, we demonstrate how such a domain-adapted Transformers model is compatible with 🤗 `transformers`'s training interface and how it outperforms an out-of-the-box (non-domain adapted) model.

These techniques are applied to BERT small but the codebase is written to be generalizable to other classes of Transformers supported by HuggingFace.

### Caveats
For this guide, we use a much smaller subset (<0.05%) of the in-domain corpora due to memory and time constraints. 

## Constants
We first load some constants, including the appropriate model card and relevant paths to text corpora.

There are two types of corpora in the context of Domain Adaptation:

1. Fine-Tuning Corpus
> Given an NLP task (e.g. text classification, summarization, etc.), the text portion of this dataset is the fine-tuning corpus.

2. In-Domain Corpus
> This is an unsupervised text dataset that is used for domain pre-training. The text domain is the same as, if not broader than, the domain of fine-tuning corpus.

In [3]:
model_card = 'bert-base-uncased'

# Domain-pre-training corpora
dpt_corpus_train = 'pubmed_subset_train.txt'
dpt_corpus_train_data_selected = 'pubmed_subset_train_data_selected.txt'
dpt_corpus_val = 'pubmed_subset_val.txt'

# Fine-tuning corpora
# If there are multiple downstream NLP tasks/corpora, you can concatenate those files together
ft_corpus_train = 'BC2GM_train.txt'

### Load model and tokenizer
Next we load the model and its corresponding tokenizer.

In [4]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained(model_card)
tokenizer = AutoTokenizer.from_pretrained(model_card)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




## Data Selection
Not all data in the in-domain corpora may be helpful or relevant during domain pre-training. For irrelevant documents, at best, it does not degrade the domain-adapted model performance. At worst, the model regresses and loses valuable pre-trained information — catastrophic forgetting.

As such, we select documents from the in-domain corpus that are likely to be relevant for the downstream fine-tuning dataset(s), using a variety of similarity and diversity metrics designed by TODO: Cite.

In [5]:
from pathlib import Path

from transformers_domain_adaptation import DataSelector


selector = DataSelector(
    select=0.5,  # TODO Replace with `keep`
    tokenizer=tokenizer,
    similarity_metrics=['euclidean'],
    # similarity_metrics=[
    #     "jensen-shannon",
    #     "renyi",
    #     "cosine",
    #     "euclidean",
    #     "variational",
    #     "bhattacharyya",
    # ],
    diversity_metrics=[
        "num_token_types",
        "type_token_ratio",
        "entropy",
        "simpsons_index",
        "renyi_entropy",
    ],
)

In [6]:
# Load text data into memory
fine_tuning_texts = Path(ft_corpus_train).read_text().splitlines()
training_texts = Path(dpt_corpus_train).read_text().splitlines()

# Fit on fine-tuning corpus
selector.fit(fine_tuning_texts)

# Select relevant documents from in-domain training corpus
selected_corpus = selector.transform(training_texts)

# Save selected corpus to disk under `dpt_corpus_train_data_selected`
Path(dpt_corpus_train_data_selected).write_text('\n'.join(selected_corpus));

Token indices sequence length is longer than the specified maximum sequence length for this model (454407 > 512). Running this sequence through the model will result in indexing errors
  return np.true_divide(self.todense(), other)
computing similarity: 100%|██████████| 1/1 [00:01<00:00,  1.16s/metric]
computing diversity: 100%|██████████| 5/5 [00:02<00:00,  1.79metric/s]


Since we specified `keep=0.5` in the `DataSelector`, the selected corpus should be half the size of the in-domain corpus, containing the top 50% most relevant documents.

In [7]:
len(training_texts), len(selected_corpus)

(10000, 5000)

In [8]:
selected_corpus[0]

'A 9-year-old female presented to neurology outpatient department of our hospital with complaints of recurrent generalized tonic-clonic seizures since birth and was being treated with anticonvulsants for the same. Patient also had complaints of giddiness and episodes of momentary loss of consciousness. There was history of twitching of left hemiface and eyelid during infancy, often associated with deviation of eyes to the left and groaning. The birth history was unremarkable. Family history revealed no known consanguinity. General examination revealed no dysmorphic features. Neurological examination revealed no cognitive deficits/signs to suggest cerebellar pathology. An electroencephalogram was done in view of her recurrent seizures, which was normal. Initial laboratory work-up was normal. The patient then underwent magnetic resonance imaging (MRI) brain, acquired with a 1.5-T unit (Siemens, Erlangen, Germany). MRI brain revealed hemihypertrophy of left cerebellar hemisphere with diso

## Vocabulary Augmentation
We can extend the existing vocabulary of the model to include domain-specific terminology. This allows for the representation such terminology to be explicit learnt during domain pre-training.

In [9]:
from transformers_domain_adaptation import VocabAugmentor

target_vocab_size = 31_000  # len(tokenizer) == 30_522

augmentor = VocabAugmentor(
    tokenizer=tokenizer, 
    cased=False, 
    target_vocab_size=target_vocab_size
)

# Obtain new domain-specific terminology based on the fine-tuning corpus
new_tokens = augmentor.get_new_tokens(ft_corpus_train)

In [10]:
print(new_tokens[:20])

['cdna', 'transcriptional', 'tyrosine', 'phosphorylation', 'kda', 'homology', 'enhancer', 'assays', 'exon', 'nucleotide', 'genomic', 'encodes', 'deletion', 'polymerase', 'nf', 'cloned', 'recombinant', 'putative', 'transcripts', 'homologous']


#### Update model and tokenizer with new vocab terminologies

In [11]:
tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer))

Embedding(31000, 768)

## Domain Pre-Training
Domain pre-training is the third step in domain adaptation — we continue training Transformer models with the same pre-training procedure on the in-domain corpus.

#### Create dataset

In [12]:
import itertools as it
from pathlib import Path
from typing import Sequence, Union, Generator

from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments

In [13]:
datasets = load_dataset(
    'text', 
    data_files={
        "train": dpt_corpus_train_data_selected, 
        "val": dpt_corpus_val
    }
)

tokenized_datasets = datasets.map(
    lambda examples: tokenizer(examples['text'], truncation=True, max_length=model.config.max_position_embeddings), 
    batched=True
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1038.0, style=ProgressStyle(description…

Using custom data configuration default



Downloading and preparing dataset text/default-6db8118d359634b1 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/text/default-6db8118d359634b1/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-6db8118d359634b1/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




#### Instantiate TrainingArguments and Trainer

In [15]:
training_args = TrainingArguments(
    output_dir="./results/domain_pre_training",
    overwrite_output_dir=True,
    max_steps=100,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    save_steps=50,
    save_total_limit=2,
    logging_steps=50,
    seed=42,
    # fp16=True,
    dataloader_num_workers=2,
    disable_tqdm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
    data_collator=data_collator,
    tokenizer=tokenizer,  # This tokenizer has new tokens
)

In [None]:
trainer.train()

Step,Training Loss,Validation Loss


## Fine-Tuning for Specific Tasks
We can plug our domain-adapted model for any fine-tuning tasks supported by HuggingFace.

For this guide, we will compare the performance between an out-of-the-box (OOB) model performs against a domain-adapted model for Named Entity Recognitition on the BC2GM dataset, a popular biomedical benchmarking dataset.

Utility functions for NER preprocessing and evaluation are adapted from HuggingFace's [NER fine-tuning example notebook](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb).

#### Preprocess raw dataset to form NER dataset

In [None]:
from typing import NamedTuple
from functools import partial
from typing_extensions import Literal

import numpy as np
from datasets import Dataset, load_dataset, load_metric


class Example(NamedTuple):
    token: str
    label: str
        
def load_ner_dataset(mode: Literal['train', 'val', 'test']):
    file = f"BC2GM_{mode}.tsv"
    examples = []
    with open(file) as f:
        token = []
        label = []
        for line in f:
            if line.strip() == "":
                examples.append(Example(token=token, label=label))
                token = []
                label = []
                continue
            t, l = line.strip().split("\t")
            token.append(t)
            label.append(l)
            
    res = list(zip(*[(ex.token, ex.label) for ex in examples]))
    d = {'token': res[0], 'labels': res[1]}
    return Dataset.from_dict(d)


def tokenize_and_align_labels(examples, tokenizer):
    tokenized_inputs = tokenizer(examples["token"], truncation=True, is_split_into_words=True)
    label_to_id = dict(map(reversed, enumerate(label_list)))

    labels = []
    for i, label in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label_to_id[label[word_idx]])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label_to_id[label[word_idx]])
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [None]:
label_list = ["O", "B", "I"]
metric = load_metric('seqeval')

train_dataset = load_ner_dataset('train')
val_dataset = load_ner_dataset('val')
test_dataset = load_ner_dataset('test')

#### Instantiate NER models
Here we instantiate three task-specific NER models for comparison:
1. `da_model`: A domain-adapted NER model we just trained in this guide
2. `da_full_corpus_model`: The same domain-adapted NER model except that it was trained on the full in-domain training corpus
3. `oob_model`: An out-of-the-box BERT NER model (not domain-adapted)

In [None]:
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification

best_checkpoint = './results/domain_pre_training/checkpoint-100'
da_model = AutoModelForTokenClassification.from_pretrained(best_checkpoint, num_labels=len(label_list))

da_full_corpus_model = AutoModelForTokenClassification.from_pretrained('./cached', num_labels=len(label_list))
full_corpus_tokenizer = AutoTokenizer.from_pretrained('./cached')

oob_tokenizer = AutoTokenizer.from_pretrained(model_card)
oob_model = AutoModelForTokenClassification.from_pretrained(model_card, num_labels=len(label_list))

#### Create datasets, TrainingArguments and Trainer for each model

In [None]:
from typing import Dict

from datasets import Dataset


def preprocess_datasets(tokenizer, **datasets) -> Dict[str, Dataset]:
    tokenize_ner = partial(tokenize_and_align_labels, tokenizer=tokenizer)
    return {k: ds.map(tokenize_ner, batched=True) for k, ds in datasets.items()}

######################
##### `da_model` #####
######################
da_datasets = preprocess_datasets(
    tokenizer, 
    train=train_dataset, 
    val=val_dataset, 
    test=test_dataset
)

training_args = TrainingArguments(
    output_dir="./results/domain_adapted_fine_tuning",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    logging_steps=100,
    seed=42,
    fp16=True,
    dataloader_num_workers=2,
    disable_tqdm=False
)

da_trainer = Trainer(
    model=da_model,
    args=training_args,
    train_dataset=da_datasets['train'],
    eval_dataset=da_datasets['val'],
    data_collator=DataCollatorForTokenClassification(tokenizer),
    tokenizer=tokenizer,  # This tokenizer has new tokens
    compute_metrics=compute_metrics
)


##################################
##### `da_model_full_corpus` #####
##################################
da_full_corpus_datasets = preprocess_datasets(
    full_corpus_tokenizer, 
    train=train_dataset, 
    val=val_dataset, 
    test=test_dataset
)

training_args = TrainingArguments(
    output_dir="./results/domain_adapted_full_corpus_fine_tuning",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    logging_steps=100,
    seed=42,
    fp16=True,
    dataloader_num_workers=2,
    disable_tqdm=False
)

da_full_corpus_trainer = Trainer(
    model=da_full_corpus_model,
    args=training_args,
    train_dataset=da_full_corpus_datasets['train'],
    eval_dataset=da_full_corpus_datasets['val'],
    data_collator=DataCollatorForTokenClassification(full_corpus_tokenizer),
    tokenizer=full_corpus_tokenizer,  # This tokenizer has new tokens
    compute_metrics=compute_metrics
)


#######################
##### `oob_model` #####
#######################
oob_datasets = preprocess_datasets(
    oob_tokenizer, 
    train=train_dataset, 
    val=val_dataset, 
    test=test_dataset
)

training_args = TrainingArguments(
    output_dir="./results/out_of_the_box_fine_tuning",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    logging_steps=100,
    seed=42,
    fp16=True,
    dataloader_num_workers=2,
    disable_tqdm=False
)

oob_model_trainer = Trainer(
    model=oob_model,
    args=training_args,
    train_dataset=oob_datasets['train'],
    eval_dataset=oob_datasets['val'],
    data_collator=DataCollatorForTokenClassification(oob_tokenizer),
    tokenizer=oob_tokenizer,  # This is the original tokenizer (without domain-specific tokens)
    compute_metrics=compute_metrics
)

#### Train and evaluate `da_model`

In [None]:
da_trainer.train()
da_trainer.evaluate(test_dataset)

#### Train and evaluate `da_model_full_corpus`

In [None]:
da_full_corpus_trainer.train()
da_full_corpus_trainer.evaluate(test_dataset)

#### Train and evaluate `oob_model`

In [None]:
oob_model_trainer.train()
oob_model_trainer.evaluate(test_dataset)

#### Results
We see that out of the three models, `da_full_corpus_model` (which was domain-adapted on the entire in-domain training corpus) outperforms the `oob_model`. In fact, this `da_full_corpus_model` model is one of many domain-adapted models we trained that outperforms SOTA on BC2GM.

Also, `da_model` underperforms `oob_model`. This is to be expected, as `da_model` has minimal domain pre-training in this guide.

## Conclusion
In this guide, you have seen how to use `DataSelector` and `VocabAugmentor` to domain-adapt a transformers model, by performing Data Selection and Vocabulary Augmentation respectively.

You have also seen that they are compatible with all of HuggingFace products: `transformers`, `tokenizers` and `datasets`.

Finally, it is shown that a model domain-adapted on the full in-domain corpus performs better than an out-of-the-box model.