# Tuning 🤗 Transformers with Population Based Training
 

In this notebook we show how to fine tune our Huggingface transformers using Population Based Training. The corresponding blog post is [here](https://medium.com/@amog_97444/c4e32c6c989b?source=friends_link&sk=92c2ed36420cd9e26281fd51da7c19b6).

For our implementation of the fine tuning, we used [Ray Tune](https://https://docs.ray.io/en/master/tune/index.html), an open source library for scalable hyperparameter tuning. It is built on top of the [Ray](https://https://ray.io/) framework, which makes it perfect for parallel hyperparameter tuning on multiple GPUs. Make sure to set you runtime to use GPUs when going through this notebook. Since Colab provides us with limited memory and a single GPU, we use a much smaller transformer (tiny-distilroberta), run only 3 samples, and use a perturbation interval of 2 iterations in this notebook. The results in the blog post were obtained with a standard BERT model, 8 samples, perturbation after every iteration, and was run on a AWS p3.16xlarge instance. The exact code used for the blog post is [here](https://https://docs.ray.io/en/master/tune/examples/pbt_transformers.html)

Let’s take a look at how we can implement parallel Population Based Training for our transformers using this library!

## Setup

The first step is to import our main libraries:

In [1]:
!pip install transformers==3.0.2
!pip install ray==0.8.7
!pip install ray[tune]

Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 12.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 38.6 MB/s 
Collecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 57.5 MB/s 
Collecting tokenizers==0.8.1.rc1
  Downloading tokenizers-0.8.1rc1-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 66.0 MB/s 
Installing collected packages: tokenizers, sentencepiece, sacremoses, transformers
Successfully installed sacremoses-0.0.47 sentencepiece-0.1.96 tokenizers-0.8.1rc1 transformers-3.0.2
Collecting ray==0.8.7
  Downloading ray-0.8.7-cp37-cp37m-manylinux1_x86_64.whl (22.0 MB)
[K     |████████████████████████████████| 22.0 MB 1.9 MB/s 
Coll

Depending on your current setup, there might be other libraries you have to install like torch. Also if you’re wondering how I made the beautiful plots in the blog post, it’s with a library called [Weights & Biases](https://https://www.wandb.com/). If you'd like, we’ll go through how we can easily integrate W&B with our code as well so you can visualize your training runs, though using W&B is optional. First, create an account with them, and then we can install it and login:


In [None]:
#!pip install wandb
#import os
#os.environ["WANDB_API_KEY"] = "567cfcfcfb79b870512bc37972a2c7d1a3d158f8"

Now we can get started with our code! The first step is to start up ray. If you’re running this on a cluster, make sure to specify an address to ray. For this notebook example, we don't have to worry about this. Also make sure to set log_to_driver to False, otherwise we get hit with a bunch of unnecessary tqdm training bars!

In [2]:
import ray

# If running on a cluster uncomment use the line below instead 
# ray.init(address="auto", log_to_driver=False)

ray.shutdown()
ray.init(log_to_driver=True, ignore_reinit_error=True)

2022-01-26 16:32:24,601	INFO resource_spec.py:231 -- Starting Ray with 6.98 GiB memory available for workers and up to 3.51 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2022-01-26 16:32:25,214	INFO services.py:1193 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '172.28.0.2',
 'object_store_address': '/tmp/ray/session_2022-01-26_16-32-24_599671_72/sockets/plasma_store',
 'raylet_ip_address': '172.28.0.2',
 'raylet_socket_name': '/tmp/ray/session_2022-01-26_16-32-24_599671_72/sockets/raylet',
 'redis_address': '172.28.0.2:6379',
 'session_dir': '/tmp/ray/session_2022-01-26_16-32-24_599671_72',
 'webui_url': 'localhost:8265'}

Then, we can load and cache our transformer model, tokenizer, and the RTE dataset.


In [3]:
from google.colab import files
trainingDataset = files.upload()

Saving FormattedTrainingDataset.csv to FormattedTrainingDataset.csv


In [4]:
import numpy as np
import pandas as pd
training_data = pd.read_csv('FormattedTrainingDataset.csv', header=None)#, names= ['text', 'label'])
training_data.head()

Unnamed: 0,0,1
0,State and local court rules sometimes make def...,value
1,"For example, when a person who allegedly owes ...",value
2,I urge the CFPB to find practices that involve...,policy
3,There is currently a split between the Ninth a...,fact
4,"In many states, the nominal defendant is the j...",fact


In [5]:
training_data[1] = training_data[1].replace('fact', 0)
training_data[1] = training_data[1].replace('value', 2)
training_data[1] = training_data[1].replace('policy', 1)
from sklearn.model_selection import train_test_split
train_queries, val_queries, train_labels, val_labels = train_test_split(
    training_data[0].tolist(), 
    training_data[1].tolist(), 
     
    stratify=training_data[1].tolist()
)

In [6]:
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Util import
from ray.tune.examples.pbt_transformers import utils



# Set this to whatever you like
data_dir_name = "./data"
data_dir = os.path.abspath(os.path.join(os.getcwd(), data_dir_name))
if not os.path.exists(data_dir):
    os.mkdir(data_dir, 0o755)

# Change these as needed.
model_name = "bert-base-uncased"
task_name = "rte"

task_data_dir = os.path.join(data_dir, task_name.upper())

# Download and cache tokenizer, model, and features
print("Downloading and caching Tokenizer")

# Triggers tokenizer download to cache
AutoTokenizer.from_pretrained(model_name)
print("Downloading and caching pre-trained model")

# Triggers model download to cache
AutoModelForSequenceClassification.from_pretrained(model_name)

# Download data.
#utils.download_data(task_name, data_dir)

Downloading and caching Tokenizer


Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading and caching pre-trained model


Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## Training

With everything now downloaded and cached, we can now set up our training function. Our training function defines the training execution for a single hyperparameter configuration. For now we pull these hyperparameters from a config argument, but we’ll see later how this is passed in.

First we get our datasets- we only use the first half of the dev dataset for validation, and leave the rest of testing:

In [7]:
from transformers import GlueDataTrainingArguments as DataTrainingArguments
from transformers import GlueDataset
from transformers import BertTokenizerFast
import torch

def get_datasets(config):
  model_name = "bert-base-uncased"
  tokenizer = BertTokenizerFast.from_pretrained(model_name)
  train_encodings = tokenizer(train_queries, truncation=True, padding='max_length', max_length=128)
  val_encodings = tokenizer(val_queries, truncation=True, padding='max_length', max_length=128)
  class PropDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

  train_dataset = PropDataset(train_encodings, train_labels)
  eval_dataset = PropDataset(val_encodings, val_labels)
  #DomSpec_Dataset = PropDataset(DomSpec_encodings, DomSpecTest[1].tolist())
  
  return train_dataset, eval_dataset

In [8]:
from transformers import BertTokenizerFast

model_name = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(model_name)


train_encodings = tokenizer(train_queries, truncation=True, padding='max_length', max_length=128)
val_encodings = tokenizer(val_queries, truncation=True, padding='max_length', max_length=128)

In [9]:
import torch
#Code taken from https://towardsdatascience.com/fine-tuning-a-bert-model-with-transformers-c8e49c4e008b
class PropDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = PropDataset(train_encodings, train_labels)
eval_dataset = PropDataset(val_encodings, val_labels)
#DomSpec_Dataset = PropDataset(DomSpec_encodings, DomSpecTest[1].tolist())

### Checkpointing

We also need to add extra functionality for *checkpointing*. After every epoch of training, we need to save our training state. This is crucial for Population Based Training since it allows us to continue training from where we left off even when hyperparameters are perturbed. The Huggingface Trainer provides functionality to save and load from a checkpoint, but we do have to make some modifications to integrate this with Ray Tune checkpointing and to checkpoint after every epoch. The first step is to subclass the Trainer from the transformers library. Ray Tune provides this [TuneTransformerTrainer](https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_transformers/trainer.py) subclass which we utilize. Take a look at the class- we see that it handles reporting evaluation metrics to Tune, checkpointing everytime evaluate is called, and even a way to pass in custom W&B arguments

In [10]:
import logging
import os
from typing import Dict, Optional, Tuple

from ray import tune

import transformers
from transformers.file_utils import is_torch_tpu_available
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR, is_wandb_available

import torch
from torch.utils.data import Dataset

if is_wandb_available():
  import wandb

class TuneTransformerTrainer(transformers.Trainer):
    def get_optimizers(
            self, num_training_steps
    ):
        self.current_optimizer, self.current_scheduler = super(
        ).get_optimizers(num_training_steps)
        return (self.current_optimizer, self.current_scheduler)

    def evaluate(self,
                 eval_dataset= None):
        eval_dataloader = self.get_eval_dataloader(eval_dataset)
        output = self._prediction_loop(
            eval_dataloader, description="Evaluation")
        self._log(output.metrics)

        self.save_state()

        tune.report(**output.metrics)

        return output.metrics

    def save_state(self):
        with tune.checkpoint_dir(step=self.global_step) as checkpoint_dir:
            self.args.output_dir = checkpoint_dir
            # This is the directory name that Huggingface requires.
            output_dir = os.path.join(
                self.args.output_dir,
                f"{PREFIX_CHECKPOINT_DIR}-{self.global_step}")
            self.save_model(output_dir)
            if self.is_world_master():
                torch.save(self.current_optimizer.state_dict(),
                           os.path.join(output_dir, "optimizer.pt"))
                torch.save(self.current_scheduler.state_dict(),
                           os.path.join(output_dir, "scheduler.pt"))

The only addition we have to make is to add a function to recover the checkpoint file from Tune's checkpoint directory

In [11]:
def recover_checkpoint(tune_checkpoint_dir, model_name=None):
    if tune_checkpoint_dir is None or len(tune_checkpoint_dir) == 0:
        return model_name
    # Get subdirectory used for Huggingface.
    subdirs = [
        os.path.join(tune_checkpoint_dir, name)
        for name in os.listdir(tune_checkpoint_dir)
        if os.path.isdir(os.path.join(tune_checkpoint_dir, name))
    ]
    # There should only be 1 subdir.
    assert len(subdirs) == 1, subdirs
    return subdirs[0]

Finally, we put all of these together as well as create our training arguments, model, and Huggingface Trainer:

In [12]:
from transformers import AutoConfig, TrainingArguments, glue_tasks_num_labels
from ray.tune.integration.wandb import wandb_mixin



def train_transformer(config, checkpoint_dir=None):
  train_dataset, eval_dataset = get_datasets(config)

  training_args = TrainingArguments(
        output_dir=tune.get_trial_dir(),
        learning_rate=config["learning_rate"],
        do_train=True,
        do_eval=True,
        evaluate_during_training=True,
        # Run eval after every epoch.
        eval_steps=(len(train_dataset) // config["per_gpu_train_batch_size"]) +
        1,
        # We explicitly set save to 0, and do checkpointing in evaluate instead
        save_steps=0,
        num_train_epochs=config["num_epochs"],
        max_steps=config["max_steps"],
        per_device_train_batch_size=config["per_gpu_train_batch_size"],
        per_device_eval_batch_size=config["per_gpu_val_batch_size"],
        warmup_steps=0,
        weight_decay=config["weight_decay"],
        logging_dir="./logs",
    )

  model_name_or_path = recover_checkpoint(checkpoint_dir, config["model_name"])
  num_labels = glue_tasks_num_labels[config["task_name"]]

  config = AutoConfig.from_pretrained(
        model_name_or_path,
        num_labels=3,
        finetuning_task=task_name,
    )
  model = AutoModelForSequenceClassification.from_pretrained(
        model_name_or_path,
        config=config,
    )
   
  # Use our modified TuneTransformerTrainer
  tune_trainer = TuneTransformerTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      eval_dataset=eval_dataset,
      compute_metrics=utils.build_compute_metrics_fn(task_name),
  )
  tune_trainer.train(model_name_or_path)

2022-01-26 16:33:17,895	ERROR wandb.py:14 -- pip install 'wandb' to use WandbLogger/WandbTrainableMixin.


Our training function takes in 2 parameters: config which contains all of our hyperparameters, and checkpoint_dir which is a directory containing the previous state of our trial. As we'll see below, these 2 arguments are passed in to our training function by Tune


## Hyperparameter Tuning with Ray Tune

Now that we have our training function setup, we run our hyperparameter search with Ray Tune. We first create an initial hyperparameter configuration which specifies the hyperparameters each trial will use initially. For some of our hyperparameters, we want to try different configurations, so we sample those from a distribution.

We also pass in our W&B arguments here.

In [13]:
config = {
        # These 3 configs below were defined earlier
        "model_name": model_name,
        "task_name": task_name,
        "data_dir": task_data_dir,
        "per_gpu_val_batch_size": 32,
        "per_gpu_train_batch_size": tune.choice([16, 32, 64]),
        "learning_rate": tune.uniform(1e-5, 5e-5),#3.805421441140277e-05,
        "weight_decay": tune.uniform(0.0, 0.3),#'0.2625809350807634
        "num_epochs": tune.choice([2, 3, 4, 5]),#, 6, 7, 8, 9, 10]),
        "max_steps": -1,  # We use num_epochs instead.
        #"wandb": {
        #    "project": "pbt_transformers",
        #    "reinit": True,
        #    "allow_val_change": True
        #}
    }

Now we can set up our Population Based Training scheduler

In [14]:
from ray.tune.schedulers import PopulationBasedTraining

scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        metric="eval_acc",
        mode="max",
        perturbation_interval=2,
        hyperparam_mutations={
            #"weight_decay": lambda: tune.uniform(0.0, 0.3).func(None),#0.2625809350807634,
            #"learning_rate": lambda: tune.uniform(1e-5, 5e-5).func(None),#3.805421441140277e-05, 
            "per_gpu_train_batch_size": [16, 32],#, 64],
        })

We also create a CLI reporter to view our results from the command line. We specify the hyperparameters we want to see from the command line, as well as what metrics we want to see. The metrics are the inputs to the tune.report we call we make in TuneTransformerTrainer.evaluate

In [15]:
from ray.tune import CLIReporter

reporter = CLIReporter(
        parameter_columns={
            "weight_decay": "w_decay",
            "learning_rate": "lr",
            "per_gpu_train_batch_size": "train_bs/gpu",
            "num_epochs": "num_epochs"
        },
        metric_columns=[
            "eval_acc", "eval_loss", "epoch", "training_iteration"
        ])

Finally, we pass in our training function, config, PBT scheduler, and reporter to tune:

In [16]:
analysis = tune.run(
        train_transformer,
        resources_per_trial={
            "cpu": 1,
            "gpu": 1
        },
        config=config,
        num_samples=2,
        scheduler=scheduler,
        keep_checkpoints_num=3,
        checkpoint_score_attr="training_iteration",
        progress_reporter=reporter,
        local_dir="./ray_results/",
        name="tune_transformer_pbt")

== Status ==
Memory usage on this node: 2.5/12.7 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/2 CPUs, 1/1 GPUs, 0.0/6.98 GiB heap, 0.0/2.39 GiB objects (0/1.0 GPUType:T4)
Result logdir: /content/ray_results/tune_transformer_pbt
Number of trials: 2 (1 PENDING, 1 RUNNING)
+-------------------------------+----------+-------+-----------+-------------+----------------+--------------+
| Trial name                    | status   | loc   |   w_decay |          lr |   train_bs/gpu |   num_epochs |
|-------------------------------+----------+-------+-----------+-------------+----------------+--------------|
| train_transformer_b530c_00000 | RUNNING  |       |  0.232746 | 2.67354e-05 |             32 |            2 |
| train_transformer_b530c_00001 | PENDING  |       |  0.182059 | 2.61335e-05 |             32 |            3 |
+-------------------------------+----------+-------+-----------+-------------+----------------+--------------+




[2m[36m(pid=242)[0m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
[2m[36m(pid=242)[0m - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
[2m[36m(pid=242)[0m - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2m[36m(pid=242)[0m Som

Result for train_transformer_b530c_00000:
  date: 2022-01-26_16-35-06
  done: false
  epoch: 1.0
  eval_acc: 0.8266384778012685
  eval_loss: 0.4484061419963837
  experiment_id: 29e768285e994833856e9aff4c6f1934
  experiment_tag: 0_learning_rate=2.6735e-05,num_epochs=2,per_gpu_train_batch_size=32,weight_decay=0.23275
  hostname: c78dbfef2e9a
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  pid: 242
  should_checkpoint: true
  time_since_restore: 86.68472099304199
  time_this_iter_s: 86.68472099304199
  time_total_s: 86.68472099304199
  timestamp: 1643214906
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: b530c_00000
  
== Status ==
Memory usage on this node: 5.1/12.7 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/2 CPUs, 1/1 GPUs, 0.0/6.98 GiB heap, 0.0/2.39 GiB objects (0/1.0 GPUType:T4)
Result logdir: /content/ray_results/tune_transformer_pbt
Number of trials: 2 (1 PENDING, 1 RUNNING)
+-------------------------------+----------+-----

[2m[36m(pid=242)[0m 
[2m[36m(pid=242)[0m Iteration: 100%|██████████| 89/89 [01:08<00:00,  4.19s/it][AIteration: 100%|██████████| 89/89 [01:08<00:00,  1.29it/s]
[2m[36m(pid=242)[0m Epoch:  50%|█████     | 1/2 [01:08<01:08, 68.89s/it]
Iteration:   0%|          | 0/89 [00:00<?, ?it/s][A
[2m[36m(pid=242)[0m 
Iteration:   1%|          | 1/89 [00:00<01:04,  1.37it/s][A
[2m[36m(pid=242)[0m 
Iteration:   2%|▏         | 2/89 [00:01<00:59,  1.45it/s][A
[2m[36m(pid=242)[0m 
Iteration:   3%|▎         | 3/89 [00:02<00:58,  1.48it/s][A
[2m[36m(pid=242)[0m 
Iteration:   4%|▍         | 4/89 [00:02<00:57,  1.48it/s][A
[2m[36m(pid=242)[0m 
Iteration:   6%|▌         | 5/89 [00:03<00:56,  1.49it/s][A
[2m[36m(pid=242)[0m 
Iteration:   7%|▋         | 6/89 [00:04<00:55,  1.49it/s][A
[2m[36m(pid=242)[0m 
Iteration:   8%|▊         | 7/89 [00:04<00:54,  1.49it/s][A
[2m[36m(pid=242)[0m 
Iteration:   9%|▉         | 8/89 [00:05<00:54,  1.49it/s][A
[2m[36m(pid=242)[0

Result for train_transformer_b530c_00000:
  date: 2022-01-26_16-36-17
  done: false
  epoch: 2.0
  eval_acc: 0.8414376321353065
  eval_loss: 0.40894569307565687
  experiment_id: 29e768285e994833856e9aff4c6f1934
  experiment_tag: 0_learning_rate=2.6735e-05,num_epochs=2,per_gpu_train_batch_size=32,weight_decay=0.23275
  hostname: c78dbfef2e9a
  iterations_since_restore: 2
  node_ip: 172.28.0.2
  pid: 242
  should_checkpoint: true
  time_since_restore: 157.96877646446228
  time_this_iter_s: 71.28405547142029
  time_total_s: 157.96877646446228
  timestamp: 1643214977
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: b530c_00000
  
== Status ==
Memory usage on this node: 5.8/12.7 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/2 CPUs, 1/1 GPUs, 0.0/6.98 GiB heap, 0.0/2.39 GiB objects (0/1.0 GPUType:T4)
Result logdir: /content/ray_results/tune_transformer_pbt
Number of trials: 2 (1 PENDING, 1 RUNNING)
+-------------------------------+----------+--

[2m[36m(pid=242)[0m tcmalloc: large alloc 1313996800 bytes == 0x5647b5fbc000 @  0x7f6897e132a4 0x5646728654cc 0x5646729211a2 0x56467291a034 0x56467291a6a1 0x56467291a6ee 0x564672918f3c 0x5646729aea49 0x56467286846c 0x564672868240 0x5646728dc0f3 0x564672869afa 0x5646728dbd00 0x564672869afa 0x5646728dbd00 0x5646727a8d14 0x5646728d8fe4 0x5646728d6ced 0x5646727a8eb0 0x7f6894d5e204 0x7f6894d8f3e3 0x7f6894d5e204 0x7f6894dc5ec4 0x7f6894d5c944 0x7f6894e616ab 0x7f6894dfe60b 0x7f6894e7f35d 0x7f6894e7ce28 0x7f6894e7d3e6 0x7f6894e80684 0x7f6894e432ab
[2m[36m(pid=242)[0m tcmalloc: large alloc 1971019776 bytes == 0x5648044dc000 @  0x7f6897e132a4 0x5646728654cc 0x5646729211a2 0x56467291a034 0x56467291a6ee 0x56467291a6ee 0x564672918f3c 0x5646729aea49 0x56467286846c 0x564672868240 0x5646728dc0f3 0x564672869afa 0x5646728dbd00 0x564672869afa 0x5646728dbd00 0x5646727a8d14 0x5646728d8fe4 0x5646728d6ced 0x5646727a8eb0 0x7f6894d5e204 0x7f6894d8f3e3 0x7f6894d5e204 0x7f6894dc5ec4 0x7f6894d5c944 0x7f6894e

Result for train_transformer_b530c_00001:
  date: 2022-01-26_16-37-53
  done: false
  epoch: 1.0
  eval_acc: 0.8160676532769556
  eval_loss: 0.4691110769907633
  experiment_id: 2fa8a5643c554a3ab63c0a0a3ad10e59
  experiment_tag: 1_learning_rate=2.6133e-05,num_epochs=3,per_gpu_train_batch_size=32,weight_decay=0.18206
  hostname: c78dbfef2e9a
  iterations_since_restore: 1
  node_ip: 172.28.0.2
  pid: 243
  should_checkpoint: true
  time_since_restore: 80.53484034538269
  time_this_iter_s: 80.53484034538269
  time_total_s: 80.53484034538269
  timestamp: 1643215073
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: b530c_00001
  
== Status ==
Memory usage on this node: 6.3/12.7 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/2 CPUs, 1/1 GPUs, 0.0/6.98 GiB heap, 0.0/2.39 GiB objects (0/1.0 GPUType:T4)
Result logdir: /content/ray_results/tune_transformer_pbt
Number of trials: 2 (1 PAUSED, 1 RUNNING)
+-------------------------------+----------+------

[2m[36m(pid=243)[0m 
[2m[36m(pid=243)[0m Iteration: 100%|██████████| 89/89 [01:11<00:00,  4.21s/it][AIteration: 100%|██████████| 89/89 [01:11<00:00,  1.25it/s]
[2m[36m(pid=243)[0m Epoch:  33%|███▎      | 1/3 [01:11<02:22, 71.25s/it]
Iteration:   0%|          | 0/89 [00:00<?, ?it/s][A
[2m[36m(pid=243)[0m 
Iteration:   1%|          | 1/89 [00:00<01:01,  1.44it/s][A
[2m[36m(pid=243)[0m 
Iteration:   2%|▏         | 2/89 [00:01<00:57,  1.50it/s][A
[2m[36m(pid=243)[0m 
Iteration:   3%|▎         | 3/89 [00:01<00:56,  1.51it/s][A
[2m[36m(pid=243)[0m 
Iteration:   4%|▍         | 4/89 [00:02<00:55,  1.52it/s][A
[2m[36m(pid=243)[0m 
Iteration:   6%|▌         | 5/89 [00:03<00:54,  1.53it/s][A
[2m[36m(pid=243)[0m 
Iteration:   7%|▋         | 6/89 [00:03<00:54,  1.53it/s][A
[2m[36m(pid=243)[0m 
Iteration:   8%|▊         | 7/89 [00:04<00:53,  1.53it/s][A
[2m[36m(pid=243)[0m 
Iteration:   9%|▉         | 8/89 [00:05<00:52,  1.54it/s][A
[2m[36m(pid=243)[0

Result for train_transformer_b530c_00001:
  date: 2022-01-26_16-39-04
  done: false
  epoch: 2.0
  eval_acc: 0.8329809725158562
  eval_loss: 0.4383532017469406
  experiment_id: 2fa8a5643c554a3ab63c0a0a3ad10e59
  experiment_tag: 1_learning_rate=2.6133e-05,num_epochs=3,per_gpu_train_batch_size=32,weight_decay=0.18206
  hostname: c78dbfef2e9a
  iterations_since_restore: 2
  node_ip: 172.28.0.2
  pid: 243
  should_checkpoint: true
  time_since_restore: 151.7558581829071
  time_this_iter_s: 71.22101783752441
  time_total_s: 151.7558581829071
  timestamp: 1643215144
  timesteps_since_restore: 0
  training_iteration: 2
  trial_id: b530c_00001
  
== Status ==
Memory usage on this node: 7.1/12.7 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/2 CPUs, 1/1 GPUs, 0.0/6.98 GiB heap, 0.0/2.39 GiB objects (0/1.0 GPUType:T4)
Result logdir: /content/ray_results/tune_transformer_pbt
Number of trials: 2 (1 PAUSED, 1 RUNNING)
+-------------------------------+----------+------

2022-01-26 16:39:08,407	INFO (unknown file):0 -- gc.collect() freed 102 refs in 0.3741960749999862 seconds
[2m[36m(pid=243)[0m tcmalloc: large alloc 1313996800 bytes == 0x560f9cfd4000 @  0x7f6980ab52a4 0x560e3f0d44cc 0x560e3f1901a2 0x560e3f189034 0x560e3f1896a1 0x560e3f1896ee 0x560e3f187f3c 0x560e3f21da49 0x560e3f0d746c 0x560e3f0d7240 0x560e3f14b0f3 0x560e3f0d8afa 0x560e3f14ad00 0x560e3f0d8afa 0x560e3f14ad00 0x560e3f017d14 0x560e3f147fe4 0x560e3f145ced 0x560e3f017eb0 0x7f697da00204 0x7f697da313e3 0x7f697da00204 0x7f697da67ec4 0x7f697d9fe944 0x7f697db036ab 0x7f697daa060b 0x7f697db2135d 0x7f697db1ee28 0x7f697db1f3e6 0x7f697db22684 0x7f697dae52ab
[2m[36m(pid=243)[0m tcmalloc: large alloc 1971019776 bytes == 0x560feb4f4000 @  0x7f6980ab52a4 0x560e3f0d44cc 0x560e3f1901a2 0x560e3f189034 0x560e3f1896ee 0x560e3f1896ee 0x560e3f187f3c 0x560e3f21da49 0x560e3f0d746c 0x560e3f0d7240 0x560e3f14b0f3 0x560e3f0d8afa 0x560e3f14ad00 0x560e3f0d8afa 0x560e3f14ad00 0x560e3f017d14 0x560e3f147fe4 0x560e3

== Status ==
Memory usage on this node: 5.3/12.7 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/2 CPUs, 1/1 GPUs, 0.0/6.98 GiB heap, 0.0/2.39 GiB objects (0/1.0 GPUType:T4)
Result logdir: /content/ray_results/tune_transformer_pbt
Number of trials: 2 (1 ERROR, 1 PAUSED)
+-------------------------------+----------+-------+-----------+-------------+----------------+--------------+------------+-------------+---------+----------------------+
| Trial name                    | status   | loc   |   w_decay |          lr |   train_bs/gpu |   num_epochs |   eval_acc |   eval_loss |   epoch |   training_iteration |
|-------------------------------+----------+-------+-----------+-------------+----------------+--------------+------------+-------------+---------+----------------------|
| train_transformer_b530c_00000 | ERROR    |       |  0.232746 | 2.67354e-05 |             32 |            2 |   0.841438 |    0.408946 |       2 |                    2 |
| train_transfo

TuneError: ignored

Let’s dive deeper into what’s going on here. Initially, tune creates 3 (from num_samples) trials, or instantiations of our training function. Each trial has a hyperparameter configuration provided by config. So we have 3 different executions of transformer fine-tuning, each with different hyperparameters, all running in parallel. However, we also pass in a PBT scheduler, with time_attr set to training_iteration and perturbation_interval set to 2. So, after 2 training iterations, we see PBT come into effect. The bottom 25% of trials according to eval_acc exploit from the top 25% of trials by copying over their model weights and hyperparameters. Then after copying over, we do exploration on these trials, by mutating certain hyperparameters specified by hyperparam_mutations. This is where checkpointing becomes crucial- this process results in a creation of a new trial, so we need checkpointing to continue training where we left off, except with the new hyperparameters. This process continues after each training iteration, and instead of randomly searching across our entire hyperparameter space, we can focus on the best performing trials and do a more fine-grained search in that smaller area.

## Testing the Best Model

Once our hyperparameter tuning experiment is complete, we can get the best performin model and try it out on our test set.

In [None]:
data_args = DataTrainingArguments(task_name=config["task_name"], data_dir=config["data_dir"])

tokenizer = AutoTokenizer.from_pretrained(config["model_name"])

best_config = analysis.get_best_config(metric="eval_acc", mode="max")
print(best_config)
best_checkpoint = recover_checkpoint(
    analysis.get_best_trial(metric="eval_acc",
                            mode="max").checkpoint.value)
print(best_checkpoint)
best_model = AutoModelForSequenceClassification.from_pretrained(
    best_checkpoint).to("cuda")

test_args = TrainingArguments(output_dir="./best_model_results", )
#test_dataset = GlueDataset(
    #data_args, tokenizer=tokenizer, mode="dev", cache_dir=data_dir)
test_dataset = eval_dataset

test_trainer = transformers.Trainer(
    best_model,
    test_args,
    compute_metrics=utils.build_compute_metrics_fn(task_name))

metrics = test_trainer.evaluate(test_dataset)
print(metrics)

{'model_name': 'bert-base-uncased', 'task_name': 'rte', 'data_dir': '/content/data/RTE', 'per_gpu_val_batch_size': 32, 'per_gpu_train_batch_size': 16, 'learning_rate': 3.805421441140277e-05, 'weight_decay': 0.2625809350807634, 'num_epochs': 5, 'max_steps': -1}
/content/ray_results/tune_transformer_pbt/train_transformer_0_per_gpu_train_batch_size=16_2021-11-08_11-45-51ky7l_ixa/checkpoint_890/checkpoint-890


Evaluation:   0%|          | 0/119 [00:00<?, ?it/s]

{'eval_loss': 0.8576692075863825, 'eval_acc': 0.8403805496828752}


In [None]:
config = {
        # These 3 configs below were defined earlier
        "model_name": model_name,
        "task_name": task_name,
        "data_dir": task_data_dir,
        "per_gpu_val_batch_size": 32,
        "per_gpu_train_batch_size": tune.choice([16, 32, 64]),
        "learning_rate": 3.805421441140277e-05,
        "weight_decay": 0.2625809350807634,
        "num_epochs": 5,
        "max_steps": -1,  # We use num_epochs instead.
        #"wandb": {
        #    "project": "pbt_transformers",
        #    "reinit": True,
        #    "allow_val_change": True
        #}
    }

In [None]:
from sklearn.metrics import classification_report
model_predicted, _, _ = test_trainer.predict(eval_dataset)
ypred = np.argmax(model_predicted, axis=1)
print(classification_report(val_labels, ypred))

Prediction:   0%|          | 0/119 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.72      0.61      0.66       196
           1       0.93      0.89      0.91       204
           2       0.84      0.90      0.87       546

    accuracy                           0.84       946
   macro avg       0.83      0.80      0.82       946
weighted avg       0.84      0.84      0.84       946



In [None]:
#print confusion report
from sklearn.metrics import confusion_matrix

confusion_matrix(val_labels, ypred)

array([[130,   6,  60],
       [  2, 185,  17],
       [ 32,  17, 497]])

In [None]:
untokenized_features = training_data[0]
train_untokenized, test_untokenized, train_labels, test_labels = train_test_split(untokenized_features, training_data[1].tolist(), stratify=training_data[1].tolist())

list_of_statements = []
for statement in enumerate(val_queries):
    list_of_statements.append(statement[1])
print(list_of_statements)



In [None]:
list_of_true_labels = []
for label in enumerate(val_labels):
    list_of_true_labels.append(str(label[1]))
print(list_of_true_labels)

['1', '2', '2', '0', '1', '2', '0', '1', '2', '2', '2', '1', '2', '1', '1', '2', '2', '0', '2', '1', '2', '2', '0', '2', '0', '2', '2', '1', '2', '0', '1', '2', '2', '1', '0', '0', '0', '2', '1', '2', '2', '1', '2', '2', '0', '2', '2', '1', '1', '2', '0', '1', '0', '0', '0', '1', '1', '2', '2', '2', '2', '2', '2', '0', '1', '2', '2', '2', '2', '0', '1', '0', '1', '0', '2', '2', '2', '0', '1', '0', '2', '0', '2', '0', '1', '1', '1', '1', '2', '2', '1', '1', '1', '1', '2', '2', '2', '1', '2', '2', '2', '2', '2', '1', '0', '0', '0', '2', '0', '2', '0', '2', '2', '2', '1', '2', '2', '2', '2', '2', '2', '1', '0', '2', '2', '0', '0', '0', '0', '0', '0', '1', '0', '1', '1', '1', '2', '2', '0', '2', '2', '2', '2', '2', '0', '2', '2', '0', '2', '2', '2', '1', '2', '2', '2', '2', '2', '1', '1', '0', '2', '0', '0', '1', '2', '0', '2', '2', '2', '2', '2', '2', '0', '2', '2', '0', '2', '2', '1', '2', '0', '2', '2', '2', '0', '2', '2', '2', '1', '2', '2', '2', '2', '0', '2', '2', '2', '2', '2', '1',

In [None]:
list_of_predicted_labels = []
for label in enumerate(ypred):
    list_of_predicted_labels.append(str(label[1]))
print(list_of_predicted_labels)

['1', '2', '2', '2', '1', '2', '0', '0', '2', '2', '2', '1', '2', '1', '1', '2', '2', '0', '2', '1', '2', '0', '0', '2', '0', '2', '2', '2', '2', '0', '1', '2', '2', '2', '0', '2', '2', '2', '2', '2', '2', '1', '2', '2', '0', '2', '2', '1', '1', '0', '2', '1', '0', '0', '2', '1', '1', '2', '2', '2', '2', '2', '2', '0', '1', '2', '2', '2', '2', '0', '1', '2', '1', '1', '2', '2', '0', '2', '1', '0', '2', '0', '2', '2', '1', '1', '1', '1', '2', '2', '1', '1', '1', '1', '2', '2', '2', '1', '2', '2', '2', '2', '2', '1', '0', '0', '2', '2', '0', '2', '0', '2', '2', '2', '1', '2', '2', '2', '1', '2', '0', '1', '2', '2', '2', '0', '2', '0', '0', '0', '0', '1', '0', '1', '1', '1', '2', '2', '2', '2', '2', '0', '2', '2', '0', '2', '2', '2', '2', '2', '2', '1', '0', '2', '2', '2', '2', '1', '1', '0', '2', '0', '2', '1', '2', '0', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '2', '1', '2', '0', '0', '2', '2', '2', '2', '2', '2', '1', '0', '2', '2', '2', '0', '2', '2', '2', '0', '2', '1',

In [None]:
error_analysis_list = []
for index, label in enumerate(val_queries):
    error_analysis_list.append('"' + list_of_statements[index] + '", ' + list_of_true_labels[index] + ', ' + list_of_predicted_labels[index])
print(error_analysis_list)



In [None]:
f = open('BfSC_error_analysis_quotes.txt', 'a')

for statements in error_analysis_list:
    f.write(statements + '\n')
f.close

<function TextIOWrapper.close>

In [None]:
DomSpecDataset = files.upload()

In [None]:
import numpy as np
import pandas as pd
DomSpecTest = pd.read_csv('DomainSpecTest.csv', header=None)#, names= ['text', 'label'])
DomSpecTest.head()

Unnamed: 0,0,1
0,Agricultural runoff that contains antibiotics ...,fact
1,Fertilizers are overused.,value
2,Overuse of pesticides is the most significant ...,fact
3,Pesticides that kill bees should be banned.,policy
4,Animal feed should not include antibiotics.,policy


In [None]:
DomSpecTest[1] = DomSpecTest[1].replace('fact', 0)
DomSpecTest[1] = DomSpecTest[1].replace('value', 2)
DomSpecTest[1] = DomSpecTest[1].replace('policy', 1)
DomSpecTest.head()

Unnamed: 0,0,1
0,Agricultural runoff that contains antibiotics ...,0
1,Fertilizers are overused.,2
2,Overuse of pesticides is the most significant ...,0
3,Pesticides that kill bees should be banned.,1
4,Animal feed should not include antibiotics.,1


In [None]:
#tokenize words into values in dataframe for BERT
tokenized = DomSpecTest[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

#pad each line in dataframe to a uniform length
max_len = 0
for i in tokenized.values:
  if len(i) > max_len:
    max_len = len(i)
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

#create attention mask of the same shape as padded dataframe
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

#run data through BERT model
input_ids = torch.LongTensor(padded).to("cuda")
attention_mask = torch.tensor(attention_mask).to("cuda")

with torch.no_grad():
  last_hidden_states = best_model(input_ids, attention_mask=attention_mask)

#create list of processed statements
dom_spec_features = last_hidden_states[0][:,0,:].numpy()
dom_spec_labels = DomSpecTest[1]

IndexError: ignored

In [None]:
print(DomSpec_encodings)

{'input_ids': [[101, 4910, 19550, 2008, 3397, 24479, 2003, 4852, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 10768, 28228, 28863, 2015, 2024, 2058, 13901, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 2058, 8557, 1997, 20739, 22698, 2003, 1996, 2087, 3278, 3426, 1997, 10506, 5701, 7859, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
model_predicted, _, _ = test_trainer.predict(DomSpec_Dataset)
ypred = np.argmax(model_predicted, axis=1)
print(classification_report(DomSpecTest[1].tolist(), ypred))

Prediction:   0%|          | 0/24 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       1.00      0.12      0.22        64
           1       0.92      1.00      0.96        87
           2       0.34      0.82      0.48        34

    accuracy                           0.66       185
   macro avg       0.75      0.65      0.55       185
weighted avg       0.84      0.66      0.62       185

