# **Training RoBERTa using Hugging Face Transformers**

Lecturer: Hieu Tran
<br>

This notebook is used to pre-train Transformer-based models using [Huggingface](https://huggingface.co/transformers/) on your own dataset.

With the AutoClasses functionality, we can reuse the code on a large number of Transformer-based models!

This notebook is designed to:

* **Train a Transformer-based model from scratch on a text corpus.** This notebook also covers training a tokenizer for your own language/domain. The pre-training objective used in this notebook is Masked Language Modeling (MLM).

* **Use an already pre-trained Transformer-based model and fine-tune it on your own dataset.** The fine-tuning loss used in this notebook is Cross-entropy.


<br>

## **What should I know for this notebook?**

Since I am using PyTorch to fine-tune Transformer-based models, any knowledge on PyTorch is very useful.

Knowing a little bit about the [transformers](https://github.com/huggingface/transformers) and [datasets](https://github.com/huggingface/datasets) help too.

In this notebook, **I am using supervised data to pre-train / fine-tune Transformer-based models** though there is no need for labeled data in the pre-training.

<br>

## **How to use this notebook?**


This notebook pulls the dataset directly from [HuggingFace Hub](https://huggingface.co/datasets). People can also use this notebook with other datasets from the Hub or loading from local (though you should modify some lines to make it work in this case).

All parameters that can be changed are under the **Parameters Setup** section. Each parameter is nicely commented and structured to be as intuitive as possible.

<br>


## **What transformers models work with this notebook?**

This notebook is tested to work with BERT/RoBERTa. However, it should work with more types of transformers as well. People who want to use other architectures should also check the `tokenizer` and modify the appropriate parameters to match with those architectures.


<br>

## **Dataset**

This notebook will cover pre-training Transformer-based model on a dataset. I will use the Vietnamese-translated PubMed dataset [ViPubMed](https://huggingface.co/datasets/VietAI/vi_pubmed) and the Vietnamese Medical Natural Language Inference dataset [ViMedNLI](https://github.com/vietai/ViPubmed/tree/main/data/vi_mednli), which was released by VietAI Research team.

**Why this dataset?** Due to its size and simplicity, I believe that this dataset is easy to understand/use for classification and we can have more time to play with the model instead of waiting for the model training.
<br>



## **Install Dependencies**

* **[transformers](https://github.com/huggingface/transformers)** library needs to be installed to use and train Transformer-based models from HuggingFace.

* **[datasets](https://github.com/huggingface/datasets)** library needs to be installed to pull public datasets from HuggingFace Hub.

* **[tokenizers](https://github.com/huggingface/tokenizers)** library needs to be installed to train a new tokenizer.

* **[sentencepiece](https://github.com/google/sentencepiece)** library needs to be installed to train a new tokenizer.

* **[accelerate](https://github.com/huggingface/accelerate)** library needs to be installed to train transformers effectively and efficiently.


In [None]:
# Install the latest version.
!pip install transformers datasets tokenizers sentencepiece accelerate

## **Imports**

Import all needed libraries for this notebook.

Declare basic parameters used for this notebook:

* `set_seed(69)` - Always good to set a fixed seed for reproducibility.

* `device` - Look for gpu to use. I will use cpu by default if no gpu found.

In [None]:
import io
import os
import math
import torch
import warnings
from itertools import chain
from dataclasses import dataclass
from tqdm.notebook import tqdm
from collections.abc import Mapping
from datasets import load_dataset
from torch.utils.data.dataset import Dataset
from transformers.data.data_collator import DataCollatorMixin
from transformers import (CONFIG_MAPPING,
                          MODEL_FOR_MASKED_LM_MAPPING,
                          PreTrainedTokenizer,
                          TrainingArguments,
                          AutoConfig,
                          AutoTokenizer,
                          AutoModelForMaskedLM,
                          DataCollatorForLanguageModeling,
                          DataCollatorForWholeWordMask,
                          PretrainedConfig,
                          Trainer,
                          set_seed)

# Set seed for reproducibility,
set_seed(69)

# Look for gpu to use. Will use `cpu` by default if no gpu found.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


## **Helper Functions**

All classes and functions that will be used in this notebook are kept under this section to help maintain a clean look of the notebook:

**ModelDataArguments**

This class follows similar format as the [transformers]((https://github.com/huggingface/transformers) library. The main difference is the way I combined multiple types of arguments into one and used rules to make sure the arguments used are correctly set. Here are all argument details (they are also mentioned in the documentation):

* `model_type`:
  *Type of model used: bert, roberta.
  More details [here](https://huggingface.co/transformers/pretrained_models.html).*

* `config_name`:
  *Config of model used: bert, roberta.
  More details [here](https://huggingface.co/transformers/pretrained_models.html).*

* `tokenizer_name`:
  *Tokenizer used to process data for training the model.
  It usually has same name as `model_name_or_path`: bert-base-cased,
  roberta-base etc.*

* `model_name_or_path`:
  *Path to existing transformers model or name of
  transformer model to be used: bert-base-cased, roberta-base etc.
  More details [here](https://huggingface.co/transformers/pretrained_models.html).*

* `dataset_name`:
  *The name of the dataset to use (via the datasets library).*

* `dataset_config_name`:
  *The configuration name of the dataset to use (via the datasets library).*

* `train_file`:
  *The input training data file.*

* `validation_file`:
  *An optional input evaluation data file to evaluate the performance on.*

* `test_file`:
  *An optional input test data file evaluate the performance on.*

* `cache_dir`:
  *Path to cache files. It helps to save time when re-running code.*

* `preprocessing_num_workers`:
  *The number of processes to use for the preprocessing.*

* `line_by_line`:
  *Whether distinct lines of text in the dataset are to be handled as distinct sequences.*

* `whole_word_mask`:
  *Used as flag to determine if we decide to use whole word masking or not. Whole word masking means that whole words will be masked during training instead of tokens which can be chunks of words.*

* `mlm_probability`:
  *Used when training masked language models. Needs to have `mlm=True`.
  It represents the probability of masking tokens when training model.*

* `max_seq_length`:
  *The maximum total input sequence length after tokenization. Sequences longer than this will be truncated.*

* `pad_to_max_length`:
  *Whether to pad all samples to `max_seq_length`. If False, will pad the samples dynamically when batching to the maximum length in the batch.*

* `overwrite_cache`:
  *If there are any cached files, overwrite them.*

* `validation_split_percentage`:
  *The percentage of the train set used as validation set in case there's no validation split.*



In [None]:
class ModelDataArguments(object):
  r"""Arguments pertaining to which model/config/tokenizer/data we are going to fine-tune, or train.

  Eve though all arguments are optional, there still needs to be a certain
  number of arguments that require values attributed.

  Raises:

        ValueError: If `CONFIG_MAPPING` is not loaded in global variables.

        ValueError: If `model_type` is not present in `CONFIG_MAPPING.keys()`.

        ValueError: If `model_type`, `model_config_name` and
          `model_name_or_path` variables are all `None`. At least one of them
          needs to be set.

        warnings: If `model_config_name` and `model_name_or_path` are both
          `None`, the model will be trained from scratch.

  """

  def __init__(self,
               model_type=None,
               config_name=None,
               tokenizer_name=None,
               model_name_or_path=None,
               dataset_name=None,
               dataset_config_name=None,
               train_file=None,
               validation_file=None,
               test_file=None,
               cache_dir=None,
               preprocessing_num_workers=None,
               line_by_line=False,
               whole_word_mask=False,
               mlm_probability=0.15,
               max_seq_length=-1,
               pad_to_max_length=False,
               overwrite_cache=False,
               validation_split_percentage=5,
               ):

    # Make sure CONFIG_MAPPING is imported from transformers module.
    if 'CONFIG_MAPPING' not in globals():
      raise ValueError('Could not find CONFIG_MAPPING imported! ' \
                       'Make sure to import it from `transformers` module!')

    # Make sure model_type is valid.
    if (model_type is not None) and (model_type not in CONFIG_MAPPING.keys()):
      raise ValueError('Invalid `model_type`! Use one of the following: %s' %
                       (str(list(CONFIG_MAPPING.keys()))))

    # Make sure that model_type, config_name and model_name_or_path
    # variables are not all `None`.
    if not any([model_type, config_name, model_name_or_path]):
      raise ValueError('You can`t have all `model_type`, `config_name`,' \
                       ' `model_name_or_path` be `None`! You need to have' \
                       'at least one of them set!')

    # Check if a new model will be loaded from scratch.
    if not any([config_name, model_name_or_path]):
      # Setup warning to show pretty. This is an overkill
      warnings.formatwarning = lambda message,category,*args,**kwargs: \
                               '%s: %s\n' % (category.__name__, message)
      # Display warning.
      warnings.warn('You are planning to train a model from scratch! 🙀')

    # Set all data related arguments.
    self.dataset_name = dataset_name
    self.dataset_config_name = dataset_config_name
    self.train_file = train_file
    self.validation_file = validation_file
    self.test_file = test_file
    self.preprocessing_num_workers = preprocessing_num_workers
    self.line_by_line = line_by_line
    self.whole_word_mask = whole_word_mask
    self.mlm_probability = mlm_probability
    self.max_seq_length = max_seq_length
    self.pad_to_max_length = pad_to_max_length
    self.overwrite_cache = overwrite_cache
    self.validation_split_percentage = validation_split_percentage

    # Set all model and tokenizer arguments.
    self.model_type = model_type
    self.config_name = config_name
    self.tokenizer_name = tokenizer_name
    self.model_name_or_path = model_name_or_path
    self.cache_dir = cache_dir


### Model Config

Model configuration defines a Transformer-based language model architecture with a configurable number of layers, hidden size, embedding size, and etc. These settings significantly impact the number of model parameters and therefore affect the model's capacity to learn and perform well on downstream tasks.

_function_ **get_model_config(args: ModelDataArguments, override_config)**

Using the ModelDataArguments to return the model configuration. Here are all argument detailed:

* `args`: *Model and data configuration arguments needed to perform pretraining.*

* `override_config`: *Configuration to replace the old one.*

* Returns: *Model transformers configuration.*


In [None]:
def get_model_config(args: ModelDataArguments, override_config):
  r"""
  Get model configuration.

  Using the ModelDataArguments return the model configuration.

  Arguments:

    args (:obj:`ModelDataArguments`):
      Model and data configuration arguments needed to perform pretraining.

    override_config (:obj:`Config`):
      Configuration to replace the old one.

  Returns:

    :obj:`PretrainedConfig`: Model transformers configuration.

  """

  # Check model configuration.
  if args.config_name is not None:
    # Use model configure name if defined.
    model_config = AutoConfig.from_pretrained(args.config_name,
                                      cache_dir=args.cache_dir)

  elif args.model_name_or_path is not None:
    # Use model name or path if defined.
    model_config = AutoConfig.from_pretrained(args.model_name_or_path,
                                      cache_dir=args.cache_dir)

  else:
    # Use config mapping if building model from scratch.
    model_config = CONFIG_MAPPING[args.model_type]()

  if override_config:
    model_config.update(override_config)

  return model_config


### Tokenizer

The function is responsible for returning a tokenizer object that can be used to preprocess raw text data for training the pre-training model. We will load our trained tokenizer in the previous section.

_function_ **get_tokenizer(args: ModelDataArguments, local_dir, config)**

Using the ModelDataArguments return the model tokenizer and change `max_seq_length` from `args` if needed. Here are all argument detailed:

* `args`: *Model and data configuration arugments needed to perform pretraining.*

* `local_dir`: *Path to the trained tokenizer.*

* `config`: *Model Configuration.*

* Returns: *Model transformers tokenizer.*

In [None]:
def get_tokenizer(args: ModelDataArguments, local_path, config):
  r"""
  Get model tokenizer.

  Using the ModelDataArguments return the model tokenizer and change
  `max_seq_length` from `args` if needed.

  Arguments:

    args (:obj:`ModelDataArguments`):
      Model and data configuration arguments needed to perform pre-training.

    local_path (:obj:`str`):
      Path to the trained tokenizer.

    config (:obj:`Config`):
      Model Configuration.

  Returns:

    :obj:`PreTrainedTokenizer`: Model transformers tokenizer.

  """

  # Check tokenizer configuration.
  if args.tokenizer_name:
    # Use tokenizer name if defined.
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name,
                                              cache_dir=args.cache_dir)

  elif args.model_name_or_path:
    # Use tokenizer name of path if defined.
    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path,
                                              cache_dir=args.cache_dir)
  else:
    tokenizer = AutoTokenizer.from_pretrained(local_path,
                                              config=config,
                                              cache_dir=args.cache_dir)

  # Setup data maximum number of tokens.
  if args.max_seq_length <= 0:
    # Set max_seq_length to maximum length of tokenizer.
    # Input max_seq_length will be the max possible for the model.
    args.max_seq_length = tokenizer.model_max_length
  else:
    # Never go beyond tokenizer maximum length.
    args.max_seq_length = min(args.max_seq_length, tokenizer.model_max_length)

  return tokenizer


### Mask Language Models

This notebook only focuses on training Masked language models such as BERT, RoBERTa. Therefore, we will use the `AutoModelForMaskedLM` class to initialize the model.



_funcion_ **get_model(args: ModelDataArguments, model_config)**

Using the ModelDataArguments return the actual model. Here are all argument detailed:

* `args`: *Model and data configuration arguments needed to perform pretraining.*

* `model_config`: *Model transformers configuration.*

* Returns: *transformers model object.*

In [None]:
def get_model(args: ModelDataArguments, model_config):
  r"""
  Get model.

  Using the ModelDataArguments return the actual model.

  Arguments:

    args (:obj:`ModelDataArguments`):
      Model and data configuration arguments needed to perform pretraining.

    model_config (:obj:`PretrainedConfig`):
      Model transformers configuration.

  Returns:

    :obj:`torch.nn.Module`: PyTorch model.

  """

  # Make sure MODEL_FOR_MASKED_LM_MAPPING is imported from transformers module.
  if 'MODEL_FOR_MASKED_LM_MAPPING' not in globals():
    raise ValueError('Could not find MODEL_FOR_MASKED_LM_MAPPING is imported!' \
                     ' Make sure to import them from `transformers` module!')

  # Check if using pre-trained model or train from scratch.
  if args.model_name_or_path:
    # Use pre-trained model.
    if type(model_config) in MODEL_FOR_MASKED_LM_MAPPING.keys():
      # Masked language modeling head.
      return AutoModelForMaskedLM.from_pretrained(
                        args.model_name_or_path,
                        from_tf=bool(".ckpt" in args.model_name_or_path),
                        config=model_config,
                        cache_dir=args.cache_dir,
                        )
    else:
      raise ValueError(
          'Invalid `model_name_or_path`! It should be in %s or %s!' %
          str(MODEL_FOR_MASKED_LM_MAPPING.keys())
        )

  else:
      # Use model from configuration - train from scratch.
      print("Training new model from scratch!")
      return AutoModelForMaskedLM.from_config(config)


### Dataset & Preprocessing

There are two functions in this subsection: the first is to load data from Github using HF `datasets` library, and second, is to tokenize the text data into lines or chunks.

_function_ **get_dataset(args: ModelDataArguments)**

Get the raw datasets from Hugging Face. Here are all argument detailed:

* `args`: *Model and data configuration arguments needed to perform pretraining.*

* Returns: *Dataset.*

<br>

_function_ **preprocess_data(args: ModelDataArguments, train_args: TrainingArguments, dataset: Dataset, tokenizer: PreTrainedTokenizer)**

  Preprocess and tokenize the dataset.

  This function can tokenize each nonempty line in the dataset or group chunks together
  after tokenizing every text.

  Arguments:

* `args`: Model and data configuration arguments needed to perform pretraining.

* `train_args`: Training arguments needed to perform pretraining.

* `dataset`: Raw dataset that needs to be preprocessed.

* `tokenizer`: Model transformers tokenizer.

* Returns: *Tokenized Dataset.*

In [None]:
def get_dataset(args: ModelDataArguments):
  r"""
  Get the raw datasets from Hugging Face.

  Using the ModelDataArguments return the actual model.

  Arguments:

    args (:obj:`ModelDataArguments`):
      Model and data configuration arguments needed to perform pretraining.

  Returns:

    :obj:`Dataset`: PyTorch dataset that contains text data.

  """

  # Loading data using datasets.
  if args.dataset_name is not None:
    # Downloading and loading a dataset from the hub.
    raw_datasets = load_dataset(
        args.dataset_name,
        args.dataset_config_name,
        cache_dir=args.cache_dir,
    )
    # Splitting the dataset into train and validation set if need.
    if "validation" not in raw_datasets.keys():
      raw_datasets["validation"] = load_dataset(
          args.dataset_name,
          args.dataset_config_name,
          split=f"train[:{args.validation_split_percentage}%]",
          cache_dir=args.cache_dir,
      )
      raw_datasets["train"] = load_dataset(
          args.dataset_name,
          args.dataset_config_name,
          split=f"train[{args.validation_split_percentage}%:]",
          cache_dir=args.cache_dir,
      )
  else:
    data_files = {}
    if args.train_file is not None:
        data_files["train"] = args.train_file
        extension = args.train_file.split(".")[-1]

    if args.validation_file is not None:
        data_files["validation"] = args.validation_file
        extension = args.validation_file.split(".")[-1]
    if args.test_file is not None:
        data_files["test"] = args.test_file
        extension = args.test_file.split(".")[-1]
    raw_datasets = load_dataset(
        extension,
        data_files=data_files,
        field="data",
        cache_dir=args.cache_dir,
    )

    # Splitting the dataset into train and validation set if need.
    if "validation" not in raw_datasets.keys():
      raw_datasets["validation"] = load_dataset(
          extension,
          data_files=data_files,
          split=f"train[:{args.validation_split_percentage}%]",
          cache_dir=args.cache_dir,
      )
      raw_datasets["train"] = load_dataset(
          extension,
          data_files=data_files,
          split=f"train[{args.validation_split_percentage}%:]",
          cache_dir=args.cache_dir,
      )

  return raw_datasets


def preprocess_data(args: ModelDataArguments, train_args: TrainingArguments, dataset: Dataset, tokenizer: PreTrainedTokenizer):
  r"""
  Preprocess and tokenize the dataset.

  1. This function can tokenize each nonempty line in the dataset for pretraining.
  2. This function can tokenize sentence pairs for finetuning NLI task.

  Arguments:

    args (:obj:`ModelDataArguments`):
      Model and data configuration arguments needed to perform pretraining.

    train_args (:obj:`TrainingArguments`):
      Training arguments needed to perform pretraining.

    dataset (:obj:`Dataset`):
      Raw dataset that needs to be preprocessed.

    tokenizer (:obj:`PreTrainedTokenizer`):
      Model transformers tokenizer.

  Returns:

    :obj:`Dataset`: PyTorch Dataset that contains file's data.

  """

  padding = "max_length" if args.pad_to_max_length else False

  if train_args.do_train:
      column_names = list(dataset["train"].features)
  else:
      column_names = list(dataset["validation"].features)

  # Preprocessing the datasets.
  # First we tokenize all the texts.
  if args.line_by_line:
    # When using line_by_line, we just tokenize each nonempty line.
    # TODO: Implement a function to tokenize a batch of inputs
    pass
  else:
    # TODO: Implement data processing for text classification
    pass

  # TODO: Execute tokenization using datasets map()

  # Return tokenized datasets
  return tokenized_datasets

### Data Collator

Data collator is used to combine a batch of input examples into a single input tensor that can be processed by a neural network. The data collator is typically used in conjunction with a `dataloader`, which batches individual input examples together and passes them to the collator. The specific behavior of the data collator can vary depending on the task and the model being used. In this notebook, we will use DataCollator to randomly mask the tokens.

_function_ **get_collator(args: ModelDataArguments, model_config: PretrainedConfig, tokenizer: PreTrainedTokenizer)**

Collator function will be used to collate a PyTorch Dataset object. Here are all argument detailed:

* `args`: *Model and data configuration arguments needed to perform pretraining.*

* `tokenizer`: *Model transformers tokenizer.*

* Returns: *Transformers specific data collator.*

In [None]:
def get_collator(args: ModelDataArguments, tokenizer: PreTrainedTokenizer):
  r"""
  Get appropriate collator function.

  Collator function will be used to collate a PyTorch Dataset object.

  Arguments:

    args (:obj:`ModelDataArguments`):
      Model and data configuration arguments needed to perform pretraining.

    tokenizer (:obj:`PreTrainedTokenizer`):
      Model transformers tokenizer.

  Returns:

    :obj:`data_collator`: Transformers specific data collator.

  """

  # Special dataset handle depending on model type.
  if args.whole_word_mask:
    # Use whole word masking.
    return DataCollatorForWholeWordMask(
                                        tokenizer=tokenizer,
                                        mlm_probability=args.mlm_probability,
                                        )
  else:
    # Regular language modeling.
    return CustomDataCollatorForLanguageModeling(
                                        tokenizer=tokenizer,
                                        mlm=True,
                                        mlm_probability=args.mlm_probability,
                                        )

Re-implementing the masking strategy is a useful exercise to gain a deeper understanding of this important technique. While `transformers` library provides a well-implemented function, there is still value in practicing the implementation ourselves. By doing so, we can gain insights into the nuances of masking and further refine our pre-training strategies. With this knowledge, you can create customized pre-training strategies to suit your specific NLP tasks.

In [None]:
@dataclass
class CustomDataCollatorForLanguageModeling(DataCollatorForLanguageModeling):
    """
    Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they
    are not all of the same length.

    Args:
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        mlm (`bool`, *optional*, defaults to `True`):
            Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs
            with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked
            tokens and the value to predict for the masked token.
        mlm_probability (`float`, *optional*, defaults to 0.15):
            The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.
        pad_to_multiple_of (`int`, *optional*):
            If set will pad the sequence to a multiple of the provided value.
        return_tensors (`str`):
            The type of Tensor to return. Allowable values are "np", "pt" and "tf".

    <Tip>

    For best performance, this data collator should be used with a dataset having items that are dictionaries or
    BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a
    [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.

    </Tip>"""

    tokenizer: PreTrainedTokenizer
    mlm: bool = True
    mlm_probability: float = 0.15
    pad_to_multiple_of: int = None
    return_tensors: str = "pt"

    def __post_init__(self):
        if self.mlm and self.tokenizer.mask_token is None:
            raise ValueError(
                "This tokenizer does not have a mask token which is necessary for masked language modeling. "
                "You should pass `mlm=False` to train on causal language modeling instead."
            )

    def torch_call(self, examples):
        # Padding examples to the same length.
        batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)

        # If special token mask has been preprocessed, pop it from the dict.
        special_tokens_mask = batch.pop("special_tokens_mask", None)
        if self.mlm:
            batch["input_ids"], batch["labels"] = self.torch_mask_tokens(
                batch["input_ids"], special_tokens_mask=special_tokens_mask
            )
        else:
            labels = batch["input_ids"].clone()
            if self.tokenizer.pad_token_id is not None:
                labels[labels == self.tokenizer.pad_token_id] = -100
            batch["labels"] = labels
        return batch

    def torch_mask_tokens(self, inputs, special_tokens_mask = None):
        """
        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
        """
        import torch

        labels = inputs.clone()
        # We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
        probability_matrix = torch.full(labels.shape, self.mlm_probability)
        if special_tokens_mask is None:
            special_tokens_mask = [
                self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
            ]
            special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
        else:
            special_tokens_mask = special_tokens_mask.bool()

        probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
        masked_indices = torch.bernoulli(probability_matrix).bool()
        labels[~masked_indices] = -100  # We only compute loss on masked tokens

        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

        # 10% of the time, we replace masked input tokens with random word
        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[indices_random] = random_words[indices_random]

        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
        return inputs, labels


## Define model arguments

Before diving deeper, we need to define some important arguments based on our needs.

First, declare some neccessary arguments used in the next steps:

* `output_model_dir` - The output directory where the tokenizer and model checkpoints will be written.

* `model_type` - The type of the model i.g., bert, roberta, etc.

* `dataset_name` - The name of the dataset from Hugging Face.

* `vocab_size` - The size of the vocabulary.

* `max_seq_length` - The maximum number of sequence.

* `mlm_probability` - The probability of masking tokens.

* `whole_word_mask` - Whether to mask the entire word.

* `line_by_line` - Whether to tokenize each nonempty line.

In [None]:
output_model_dir = "./mlm_bert"  # @param {type:"string"}

model_type = 'roberta'  # @param {type:"string"}

datatset_name = 'razent/vi_pubmed_small'  # @param {type:"string"}

# Let’s arbitrarily pick its size to be 5,000.
vocab_size = 4000   # @param {type:"number"}

max_seq_length = 128   # @param {type:"number"}

mlm_probability = 0.15   # @param {type:"number"}

whole_word_mask = False   # @param {type:"boolean"}

line_by_line = True   # @param {type:"boolean"}

pad_to_max_length = True   # @param {type:"boolean"}

# Create model folder
if not os.path.exists(output_model_dir):
  !mkdir $output_model_dir

As I mentioned in the first section, we will use the [ViPubMed](https://huggingface.co/datasets/razent/vi_pubmed_small) dataset for pretraining part. Since we want to pre-training RoBERTa model from scratch, we only need to set the `model_type` to `roberta` (just ignore `config_name`, `model_name_or_path`). We just use the same `mlm_probability=0.15` as BERT and set the maximum number of sequences to `128` for faster training.

In [None]:
# Define arguments for data, tokenizer and model arguments.
# See comments in `ModelDataArguments` class.
model_data_args = ModelDataArguments(
                                    dataset_name=datatset_name,
                                    line_by_line=line_by_line,
                                    whole_word_mask=whole_word_mask,
                                    mlm_probability=mlm_probability,
                                    max_seq_length=max_seq_length,
                                    pad_to_max_length=pad_to_max_length,
                                    overwrite_cache=False,
                                    model_type=model_type,
                                    cache_dir=None
                                    )

## Pulling the dataset

You're right, we first need to get the dataset before going further.

In [None]:
# Setup train dataset if `do_train` is set.
print('Pulling the dataset...')
datasets = get_dataset(model_data_args)
datasets

The dataset that we are working with is composed of a total of 5,000 samples, which have been partitioned into two subsets: 4,750 samples for training, 250 samples for validation. Each of these samples includes a single column, namely a `text` column. With such a comprehensive dataset, we can confidently train and evaluate our models to achieve considerable performance in Medical Natural Language Inference.

Now that we have our corpus in the form of an iterator of texts, we are ready to train a new tokenizer.


## Train a tokenizer

In this section, we will train a byte-level Byte-Pair Encoding (BPE) tokenizer from scratch.

We recommend training a byte-level BPE because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!). So we don’t need to specify an `unk_token`.

Before the tokenizer splits a text into subtokens, it will need to perform some preprocessing steps.

Here's a high-level overview of the steps in the tokenization pipeline:

![Img](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/tokenization_pipeline-dark.svg)

Fortunately, the 🤗 Tokenizers library has been built to provide several options for each of those steps, so we just need to read the documentation and mix/match them together.

The full pipeline steps are:
- Normalization (`normalizers`): any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc. [Read more](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.normalizers)
- Pre-tokenization (`pre_tokenizers`): splitting the input into words. [Read more](https://huggingface.co/docs/tokenizers/api/tokenizer#module-tokenizers.pre_tokenizers)
- Model (`models`): running the input through the model (using the pre-tokenized words to produce a sequence of tokens). [Read more](https://huggingface.co/docs/tokenizers/api/tokenizer#module-tokenizers.models)
- Post-processing (`post_processors`): adding the special tokens of the tokenizer, generating the attention mask and token type IDs. [Read more](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#module-tokenizers.processors)
+ The library provides `decoders` various types of Decoder for decoding the outputs of tokenization. [Read more](https://huggingface.co/docs/tokenizers/python/latest/components.html#decoders)

_⚠️ Training a tokenizer is not the same as training a model! Model training uses stochastic gradient descent to make the loss a little bit smaller for each batch. It’s randomized by nature (meaning you have to set some seeds to get the same results when doing the same training twice). Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and the exact rules used to pick them depend on the tokenization algorithm. It’s deterministic, meaning you always get the same results when training with the same algorithm on the same corpus._

In [None]:
from tokenizers import models, pre_tokenizers, decoders, trainers, processors, Tokenizer, ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())   # or use ByteLevelBPETokenizer()
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) # Since RoBERTa does not use a normalizer, so we skip that step and go directly to the pre-tokenization.


Though, we could initialize this model with a vocabulary if we had one (we would need to pass the `vocab.json` and `merges.txt` in this case), since we will train from scratch, we don't need to do that.

The option we added to ByteLevel here is to not add a space at the beginning of a sentence (which is the default otherwise). We can have a look at the pre-tokenization of an example text like before:

In [None]:
tokenizer.pre_tokenizer.pre_tokenize_str("Hi, my name is Hieu!")

This tokenizer has a few special symbols, like Ġ and Ċ, which denote spaces and newlines, respectively.

Next, we will start training the model with the same special tokens as RoBERTa. We also specify the `min_frequency` that indicates the number of times a token must appear to be included in the vocabulary.

In [None]:
%%time
# Customize training with only text data from the training dataset
trainer = trainers.BpeTrainer(vocab_size=vocab_size, min_frequency=2, special_tokens=[
    "<s>", #[CLS]
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])
tokenizer.train_from_iterator(iterator=datasets['train']['text'], trainer=trainer)


🔥🔥 Wow, that was fast! ⚡️🔥

What great is that our tokenizer is optimized for Vietnamese. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Vietnamese – `ỏ`, `ể`, `ừ`, `ĩ`, `ỷ`, and `ặ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained RoBERTa tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens.


In [None]:
# Add special tokens at the begining and the end of a sentence.
tokenizer.post_processor = processors.RobertaProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)

tokenizer.enable_truncation(max_length=512)

In [None]:
encoding = tokenizer.encode("Xin chào VietAI nhé!")
encoding.tokens

The encoding obtained is an Encoding, which contains all the necessary outputs of the tokenizer in its various attributes: `ids`, `type_ids`, `tokens`, `offsets`, `attention_mask`, `special_tokens_mask`, and `overflowing`.

Finally, we add a byte-level decoder:

In [None]:
tokenizer.decoder = decoders.ByteLevel()

and we can double-check it works properly:

In [None]:
tokenizer.decode(encoding.ids)

Great! Now that we’re done, we can save the tokenizer, and wrap it in a `PreTrainedTokenizerFast` or other tokenizer supported in `transformers`.

For example:
```
RobertaTokenizerFast(tokenizer_object=tokenizer)
```

In [None]:
# Save files to disk
tokenizer.save(f'{output_model_dir}/tokenizer.json')

We now have a `tokenizer.json`, which is a list of the most frequent tokens ranked by frequency. Note that, this file will be loaded as TokenizerFast later.


```json
{
  "<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	# ...
}
```

## **Parameters Setup**

Declare the rest of the parameters used for this notebook:

* `model_data_args` contains all arguments needed to setup dataset, model configuration, model tokenizer and the actual model. This is created using the `ModelDataArguments` class.

* `training_args` contains all arguments needed to use the [Trainer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.Trainer) functionality from Transformers that allows us to train transformers models in PyTorch very easy. You can find the complete documentation [here](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). There are a lot of parameters that can be set to allow multiple functionalities:

  * `output_dir`: *The output directory where the model predictions and checkpoints will be written. I set it up to `output_model_dir` where the model and will be saved.*

  * `overwrite_output_dir`: *Overwrite the content of the output directory. I set it to `True` in case I run the notebook multiple times and I only care about the last run.*

  * `do_train`: *Whether to run training or not. I set this parameter to `True` because I want to train the model on my own dataset.*

  * `do_eval`: *Whether to run evaluation or not.
I set it to `True` since I have test data file and I want to evaluate how well the model trains.*

  * `per_device_train_batch_size`: *Batch size GPU/TPU core/CPU training. I set it to `2` for this example. I recommend setting it up as high as your GPU memory allows you.*

  * `per_device_eval_batch_size`: *Batch size  GPU/TPU core/CPU for evaluation.*

  * `evaluation_strategy`: *Evaluation strategy to adopt during training:*
    - `no`: *No evaluation during training;*
    
    - `steps`: *Evaluate every `eval_steps;*
    
    - `epoch`: *Evaluate every end of epoch. I set it to 'steps' since I want to evaluate model more often.*

  * `logging_steps`: *How often to show logs. I set this to `500` just as an example. If your evaluate data is large you might not want to run it that often because it will significantly slow down training time.*

  * `eval_steps`: *Number of update steps between two evaluations if evaluation_strategy="steps". Will default to the same value as logging_steps if not set. Since I want to evaluate model ever`logging_steps` I will set this to `None` since it will inherit same value as `logging_steps`.*

  * `prediction_loss_only`: *Set prediction loss to `True` in order to return loss for perplexity calculation.*

  * `learning_rate`: *The initial learning rate for Adam. Defaults is set to `5e-5`.*

  * `weight_decay`: *The weight decay to apply (if not zero). Defaults is set to `0`.*

  * `adam_epsilon`: *Epsilon for the Adam optimizer. Defaults to `1e-8`.*

  * `max_grad_norm`: *Maximum gradient norm (for gradient clipping). Defaults to `0`.*

  * `num_train_epochs`: *Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training). I set it to `3` at most. Since the custom dataset will be a lot smaller than the original dataset the model was trained on we don't want to overfit.*

  * `save_steps`: *Number of updates steps before two checkpoint saves. Defaults to `500`.*

In [None]:
# Batch size GPU/TPU core/CPU training.
batch_size = 64  # @param {type:"number"}

# The initial learning rate for Adam. Defaults to 5e-5.
learning_rate = 1e-4  # @param {type:"number"}

# Total number of training epochs to perform (if not an integer, will perform the
# decimal part percents of the last epoch before stopping training). max_steps = 200_000,
num_train_epochs = 20  # @param {type:"number"}

# How often to show logs. I will se this to plot history loss and calculate perplexity.
logging_steps = 100  # @param {type:"number"}

# Number of updates steps before two checkpoint saves. Defaults to 500
save_steps = 100  # @param {type:"number"}

# The weight decay to apply (if not zero).
weight_decay = 0.0  # @param {type:"number"}

# Epsilon for the Adam optimizer. Defaults to 1e-8
adam_epsilon = 1e-8  # @param {type:"number"}

# Maximum gradient norm (for gradient clipping). Defaults to 0.
max_grad_norm = 1.0  # @param {type:"number"}

# The total number of saved models.
save_total_limit = 3  # @param {type:"number"}

# Set prediction loss to `True` in order to return loss for perplexity calculation.
prediction_loss_only = True # @param {type: "boolean"}

# Whether to run training or not.
do_train = True # @param {type: "boolean"}

# Whether to run evaluation or not.
do_eval = True # @param {type: "boolean"}

# Overwrite the content of the output directory.
overwrite_output_dir = True # @param {type: "boolean"}


# Define arguments for training
# `TrainingArguments` contains a lot more arguments.
# For more details check the awesome documentation:
# https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
training_args = TrainingArguments(
                          output_dir=output_model_dir,
                          overwrite_output_dir=overwrite_output_dir,
                          do_train=do_train,
                          do_eval=do_eval,
                          per_device_train_batch_size=batch_size,
                          logging_steps=logging_steps,
                          prediction_loss_only=prediction_loss_only,
                          learning_rate = learning_rate,
                          weight_decay=weight_decay,
                          adam_epsilon = adam_epsilon,
                          max_grad_norm = max_grad_norm,
                          num_train_epochs = num_train_epochs,
                          save_steps = save_steps,
                          save_total_limit = save_total_limit)



## **Load Configuration, Tokenizer and Model**

Loading the three essential parts of the pretrained transformers: configuration, tokenizer and model.

Since I use the AutoClass functionality from Hugging Face, I only need to worry about the model's name as input and the rest is handled by the transformers library.

I will be calling each three functions created in the **Helper Functions** tab that help return `config` of the model, `tokenizer` of the model and the actual PyTorch `model`.

After `model` is loaded, it is always good practice to resize the model depending on the `tokenizer` size. This means that the tokenizer's vocabulary will be aligned with the models embedding layer. This is very useful when we have a different tokenizer that the pretrained one or we train a transformer model from scratch.



In [None]:
# Load model configuration.
print('Loading model configuration...')
override_config = {
    "num_hidden_layers": 8,
    "num_attention_heads": 8,
    "intermediate_size": 2048,
    "hidden_size": 512,
    "max_position_embeddings": 130 # 128+2 for special tokens
}
config = get_model_config(model_data_args, override_config)

# Load model tokenizer.
print('Loading model`s tokenizer...')
tokenizer = get_tokenizer(model_data_args, local_path=output_model_dir, config=config)

# Loading model.
print('Loading actual model...')
model = get_model(model_data_args, config)

# Resize model to fit all tokens in tokenizer.
model.resize_token_embeddings(len(tokenizer))

# Number of model parameters
print("Number of model parameters:", model.num_parameters())

## **Preprocess Dataset and Load Data Collator**

This is where I process the PyTorch Dataset and data collator objects that will be used to feed data into our model.

I strongly recommend to use a validation set in order to determine how much training is needed in order to avoid overfitting. After you figure out what parameters yield the best results, the validation set can be incorporated in train and run a final train with the whole dataset.

In [None]:
print('Preprocessing datasets...')
text_datasets = datasets.remove_columns('label')
tokenized_datasets = preprocess_data(model_data_args, training_args, text_datasets, tokenizer)

# Split train/eval datasets
train_dataset, eval_dataset = tokenized_datasets['train'], tokenized_datasets['validation']
print('Training set:', len(train_dataset))
print('Validation set:', len(eval_dataset))


# Get data collator to modify data format depending on type of model used.
data_collator = get_collator(model_data_args, tokenizer)

# Check how many logging prints you'll have. This is to avoid overflowing the
# notebook with a lot of prints. Display warning to user if the logging steps
# that will be displayed is larger than 100.
if (len(train_dataset) // training_args.per_device_train_batch_size \
    // training_args.logging_steps * training_args.num_train_epochs) > 100:
  # Display warning.
  warnings.warn('Your `logging_steps` value will will do a lot of printing!' \
                ' Consider increasing `logging_steps` to avoid overflowing' \
                ' the notebook with a lot of prints!')

## **Train RoBERTa from scratch**

Hugging Face was very nice to us for creating the `Trainer` class. This helps make PyTorch model training of transformers very easy! We just need to make sure we loaded the proper parameters and everything else is taking care of!

At the end of the training, the tokenizer is saved along with the model so you can easily re-use it later or even load from Hugging Face.

In [None]:
%%time

# Initialize Trainer.
print('Loading `trainer`...')
trainer = Trainer(model=model,
                  args=training_args,
                  data_collator=data_collator,
                  train_dataset=train_dataset,
                  )


print('Start training...')

# Setup model path if the model to train loaded from a local path.
model_path = (model_data_args.model_name_or_path
              if model_data_args.model_name_or_path is not None and
              os.path.isdir(model_data_args.model_name_or_path)
              else None
              )


# Run training.
trainer.train(model_path=model_path)


#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
# Save model.
trainer.save_model(training_args.output_dir)

# For convenience, we also re-save the tokenizer to the same directory,
# so that you can share your model easily on huggingface.co/models =).
if trainer.is_world_process_zero():
  tokenizer.save_pretrained(training_args.output_dir)

## Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=output_model_dir,
    tokenizer=output_model_dir
)

fill_mask("Android nhập liệu bằng lời nói rất chuẩn, nhanh, nhiều ngôn <mask> cùng lúc.")

#**Finetuning**


In the finetuning phase, we will need to import the `AutoModelForSequenceClassification` class because we plan to fine-tune our pre-trained model on [**ViMedNLI**](VietAI/vi_mednli), which is one of the text classification problems.

In [None]:
from transformers import AutoModelForSequenceClassification, default_data_collator
import numpy as np

In [None]:
model_name_or_path = './mlm_bert' # @param {type:"string"}
datatset_name = 'VietAI/vi_mednli'  # @param {type:"string"}
max_seq_length = 64   # @param {type:"number"}

finetuning_model_data_args = ModelDataArguments(
                                    model_name_or_path=model_name_or_path,
                                    dataset_name=datatset_name,
                                    max_seq_length=max_seq_length,
                                    line_by_line=False,
                                    pad_to_max_length=True,
                                    overwrite_cache=False,
                                    cache_dir=None
                                    )

In the finetuning phase, it is recommended to use lower learning rate compared to pre-training. The BERT authors suggest to use the learning rate of 2e-5, 3e-5, 5e-5, etc. You should tune that number to get the highest performance. In this experiment, I will choose 2e-5.

In [None]:
# Path to save the finetuned models
output_model_dir = "./nli_models"  # @param {type:"string"}

# Batch size GPU/TPU core/CPU training.
batch_size = 64  # @param {type:"number"}

# The initial learning rate for Adam. Defaults to 5e-5.
learning_rate = 2e-5  # @param {type:"number"}

# Total number of training epochs to perform (if not an integer, will perform the
# decimal part percents of the last epoch before stopping training). max_steps = 200_000,
num_train_epochs = 20  # @param {type:"number"}

# How often to show logs. I will se this to plot history loss and calculate perplexity.
logging_steps = 100  # @param {type:"number"}

# Number of updates steps before two checkpoint saves. Defaults to 500
save_steps = 100  # @param {type:"number"}

# The weight decay to apply (if not zero).
weight_decay = 0.0  # @param {type:"number"}

# Epsilon for the Adam optimizer. Defaults to 1e-8
adam_epsilon = 1e-8  # @param {type:"number"}

# Maximum gradient norm (for gradient clipping). Defaults to 0.
max_grad_norm = 1.0  # @param {type:"number"}

# The total number of saved models.
save_total_limit = 3  # @param {type:"number"}

# Set prediction loss to `True` in order to return loss for perplexity calculation.
prediction_loss_only = False # @param {type: "boolean"}

# Whether to run training or not.
do_train = True # @param {type: "boolean"}

# Whether to run evaluation or not.
do_eval = True # @param {type: "boolean"}

# Overwrite the content of the output directory.
overwrite_output_dir = True # @param {type: "boolean"}


# Define arguments for training
# `TrainingArguments` contains a lot more arguments.
# For more details check the awesome documentation:
# https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
downstream_training_args = TrainingArguments(
                          output_dir=output_model_dir,
                          overwrite_output_dir=overwrite_output_dir,
                          do_train=do_train,
                          do_eval=do_eval,
                          per_device_train_batch_size=batch_size,
                          logging_steps=logging_steps,
                          prediction_loss_only=prediction_loss_only,
                          learning_rate = learning_rate,
                          weight_decay=weight_decay,
                          adam_epsilon = adam_epsilon,
                          max_grad_norm = max_grad_norm,
                          num_train_epochs = num_train_epochs,
                          save_steps = save_steps,
                          save_total_limit = save_total_limit)



Let's load the finetuning dataset!

We also need to re-process the dataset. At that time, we should keep both the text and the label.

In [None]:
datasets = get_dataset(finetuning_model_data_args)

label_list = datasets["train"].unique("label")
label_list.sort()  # Let's sort it for determinism
num_labels = len(label_list)


tokenized_datasets = preprocess_data(finetuning_model_data_args, downstream_training_args, datasets, tokenizer)

# Split train/eval datasets
train_dataset, eval_dataset, test_dataset = tokenized_datasets['train'], tokenized_datasets['validation'], tokenized_datasets['test']
print('Training set:', len(train_dataset))
print('Validation set:', len(eval_dataset))
print('Test set:', len(test_dataset))

The dataset that we are working with is composed of a total of 14,049 samples, which have been partitioned into three subsets: 11,232 samples for training, 1,395 samples for validation, and 1,422 samples for testing. Each of these samples includes three columns, namely a `sentence1` column, `sentence2` column and a `label` column. The `label` column has three classes, including neutral, entailment, and contradiction.

Then let's initialize the Classifier. You should note that you will need to indicate the number of target labels. Don't forget that.

In [None]:
config = AutoConfig.from_pretrained(
        model_name_or_path,
        num_labels=num_labels,
        finetuning_task='nli',
        cache_dir=finetuning_model_data_args.cache_dir,
)


classifier = AutoModelForSequenceClassification.from_pretrained(
        model_name_or_path,
        config=config,
        cache_dir=finetuning_model_data_args.cache_dir,
        ignore_mismatched_sizes=True,
)

We need a general-purpose function used to calculate various performance metrics for a given set of predictions and their corresponding ground-truth labels.

This function is particularly necessary to evaluate the performance of a model on a test set of data. By computing relevant performance metrics, such as accuracy, precision, recall, F1-score, and AUC-ROC, we can assess the quality of our model's predictions and make informed decisions about potential improvements.

In this notebook, we will calcuate the `accuracy` metric to measure our trained models.

In [None]:
# You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a
# predictions and label_ids field) and has to return a dictionary string to float.
def compute_metrics(p):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}


Data collator will default to DataCollatorWithPadding when the tokenizer is passed to Trainer, so we change it if we already did the padding.

We now start finetuning the model on the Sentence-pair classification (**ViMedNLI**) dataset.

In [None]:
# Initialize our Trainer
trainer = Trainer(model=classifier,
                  args=downstream_training_args,
                  data_collator=default_data_collator if finetuning_model_data_args.pad_to_max_length else None,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  compute_metrics=compute_metrics,
                  )

print('Start training...')
trainer.train()


Yeah! Our model has completed training. Now we can evaluate the model on the test dataset.

Let's see how much we can achieve>>>

In [None]:
trainer.evaluate(test_dataset)

The result looks good, right?!

Congratulations! We have completed a very difficult challenge. Best of luck with your future career in NLP.

## **References**

This notebook is **very heavily inspired** from [gmihaila/ml_things](https://github.com/gmihaila/ml_things) and the [Hugging Face script](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) used for training language models. It is updated to work with the latest `transformers`.
