## How to add new AL query strategies / unlabeled pool subsampling strategies

This notebook demonstrates three simple steps to benchmark a new query / unlabeled pool subsampling strategy.

#### 1. Prepare the file and the global variable

In [1]:
# Where the file with the strategy is located
FOLDER_WITH_STRATEGIES = "custom_strategy"
!mkdir $FOLDER_WITH_STRATEGIES
# Name of the AL strategy & file
AL_STRATEGY_NAME = "least_confidence.py"
# Name of the unlabeled pool subsampling strategy addition
SUBSAMPLING_STRATEGY_NAME = "top_from_previous_iteration_subsampling.py"
CUR_PATH = !pwd
# Absolute path to the AL strategy
PATH_TO_AL_STRATEGY = f"{CUR_PATH[0]}/{FOLDER_WITH_STRATEGIES}/{AL_STRATEGY_NAME}"
# Absolute path to the strategy
PATH_TO_SUBSAMPLING_STRATEGY = (
    f"{CUR_PATH[0]}/{FOLDER_WITH_STRATEGIES}/{SUBSAMPLING_STRATEGY_NAME}"
)

mkdir: cannot create directory ‘custom_strategy’: File exists


#### 2. Write your strategies

In [2]:
%%writefile $PATH_TO_AL_STRATEGY

import numpy as np

def least_confidence(model, X_pool, n_instances, **kwargs):
    probas = model.predict_proba(X_pool)
    uncertainty_estimates = 1 - probas.max(axis=1)
    query_idx = np.argsort(-uncertainty_estimates)[:n_instances]
    query = X_pool.select(query_idx)
    return query_idx, query, uncertainty_estimates

Overwriting /home/atsvigun/active_learning/examples/custom_strategy/least_confidence.py


In [3]:
%%writefile $PATH_TO_SUBSAMPLING_STRATEGY

import numpy as np

def top_from_previous_iteration_subsampling(uncertainty_estimates, gamma_or_k_confident_to_save, **kwargs):
    if isinstance(gamma_or_k_confident_to_save, float):
        gamma_or_k_confident_to_save = int(
            gamma_or_k_confident_to_save * len(uncertainty_estimates)
        )
    argsort = np.argsort(-uncertainty_estimates)
    return argsort[:gamma_or_k_confident_to_save]

Overwriting /home/atsvigun/active_learning/examples/custom_strategy/top_from_previous_iteration_subsampling.py


#### 3. Use your strategies:

- AL strategy: `config.al.strategy=$PATH_TO_AL_STRATEGY`

- Unlabeled pool subsampling strategy: `config.al.sampling_type=$PATH_TO_SUBSAMPLING_STRATEGY`

Test with 1 GPU: (substitute `custom_strategy/least_confidence` & `custom_strategy/top_from_previous_iteration_subsampling` with your strategies name)

In [4]:
%%bash
CUDA_VISIBLE_DEVICES='0' HYDRA_CONFIG_PATH=../acleto/al_benchmark/configs \
HYDRA_CONFIG_NAME=al_cls python ../scripts/run_active_learning.py \
al.strategy=custom_strategy/least_confidence \
al.sampling_type=custom_strategy/top_from_previous_iteration_subsampling \
acquisition_model.checkpoint=distilbert-base-uncased \
al.num_queries=2

[2022-09-14 14:10:22,670][root][INFO] - Work dir: /home/atsvigun/active_learning/examples/workdir/run_active_learning/2022-09-14/14-10-22_42_roberta_base_custom_strategy_least_confidence
[2022-09-14 14:10:39,668][root][INFO] - Successfully loaded BARTScore and SummaC models.


  def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):


[2022-09-14 14:10:41,245][root][INFO] - output_dir: ./workdir/run_active_learning
[2022-09-14 14:10:41,245][root][INFO] - seed: 42
[2022-09-14 14:10:41,245][root][INFO] - cuda_device: 0
[2022-09-14 14:10:41,245][root][INFO] - cache_dir: ././workdir/run_active_learning/cache_42_roberta_base
[2022-09-14 14:10:41,245][root][INFO] - cache_model_and_dataset: False
[2022-09-14 14:10:41,246][root][INFO] - framework: transformers
[2022-09-14 14:10:41,246][root][INFO] - task: cls
[2022-09-14 14:10:41,246][root][INFO] - offline_mode: False
[2022-09-14 14:10:41,246][root][INFO] - data
[2022-09-14 14:10:41,246][root][INFO] - 	dataset_name: ag_news
[2022-09-14 14:10:41,246][root][INFO] - 	text_name: text
[2022-09-14 14:10:41,246][root][INFO] - 	label_name: label
[2022-09-14 14:10:41,246][root][INFO] - 	labels_to_remove: None
[2022-09-14 14:10:41,246][root][INFO] - 	path: datasets
[2022-09-14 14:10:41,246][root][INFO] - 	train_size_split: 0.9
[2022-09-14 14:10:41,247][root][INFO] - 	seed: 42
[2022-0

100%|██████████| 2/2 [00:00<00:00, 11.60it/s]


[2022-09-14 14:10:47,664][root][INFO] - Loaded train size: 120000
[2022-09-14 14:10:47,665][root][INFO] - Loaded dev size: 7600
[2022-09-14 14:10:47,665][root][INFO] - Dev dataset coincides with test dataset
[2022-09-14 14:10:48,033][root][INFO] - Seeding dataset size: 1200
[2022-09-14 14:10:48,033][root][INFO] - Pool size: 118800
[2022-09-14 14:10:48,039][root][INFO] - Done.
[2022-09-14 14:10:48,040][root][INFO] - Starting active learning...
[2022-09-14 14:10:48,053][root][INFO] - Constructing the acquisition model...


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /home/atsvigun/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.3",
  "type_vocab_size": 1,
  

[2022-09-14 14:10:57,070][root][INFO] - Done with constructing the acquisition model.
[2022-09-14 14:10:57,086][root][INFO] - Training dataset size: 1200
[2022-09-14 14:10:57,086][root][INFO] - Validation dataset size: 7600


loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /home/atsvigun/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": 0.0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "World",
    "1": "Sports",
    "2": "Business",
    "3": "Sci/Tech"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "Business": 2,
    "Sci/Tech": 3,
    "Sports": 1,
    "World": 0
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "posi

[2022-09-14 14:11:06,499][root][INFO] - Load best at end: False


PyTorch: setting up devices


[2022-09-14 14:11:06,509][root][INFO] - TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=True,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=True,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_e



[2022-09-14 14:12:44,663][root][INFO] - Done with the model fit.
[2022-09-14 14:12:44,675][root][INFO] - ############### Evaluating the acquisition model. ###############


100%|██████████| 8/8 [00:00<00:00,  9.19ba/s]


[2022-09-14 14:12:52,906][root][INFO] - Initial AL iteration:
Acquisition model:
[2022-09-14 14:12:52,908][root][INFO] - test_loss: 0.5857174396514893
[2022-09-14 14:12:52,908][root][INFO] - test_accuracy: 0.8980263157894737
[2022-09-14 14:12:52,908][root][INFO] - test_f1_micro: 0.8980263157894738
[2022-09-14 14:12:52,908][root][INFO] - test_f1_macro: 0.8975508243070751
[2022-09-14 14:12:52,908][root][INFO] - test_f1_weighted: 0.897550824307075
[2022-09-14 14:12:52,908][root][INFO] - test_runtime: 7.1048
[2022-09-14 14:12:52,908][root][INFO] - test_samples_per_second: 1069.696
[2022-09-14 14:12:52,909][root][INFO] - test_steps_per_second: 10.697


AL queries done:   0%|          | 0/15 [00:00<?, ?it/s]
  0%|          | 0/119 [00:00<?, ?ba/s][A
  1%|          | 1/119 [00:00<00:18,  6.31ba/s][A
  2%|▏         | 2/119 [00:00<00:16,  7.15ba/s][A
  3%|▎         | 3/119 [00:00<00:15,  7.70ba/s][A
  3%|▎         | 4/119 [00:00<00:14,  7.86ba/s][A
  4%|▍         | 5/119 [00:00<00:14,  7.66ba/s][A
  5%|▌         | 6/119 [00:00<00:14,  7.97ba/s][A
  6%|▌         | 7/119 [00:00<00:13,  8.26ba/s][A
  7%|▋         | 8/119 [00:01<00:13,  8.43ba/s][A
  8%|▊         | 9/119 [00:01<00:12,  8.50ba/s][A
  8%|▊         | 10/119 [00:01<00:12,  8.59ba/s][A
  9%|▉         | 11/119 [00:01<00:12,  8.72ba/s][A
 10%|█         | 12/119 [00:01<00:12,  8.79ba/s][A
 11%|█         | 13/119 [00:01<00:11,  8.91ba/s][A
 12%|█▏        | 14/119 [00:01<00:11,  8.95ba/s][A
 13%|█▎        | 15/119 [00:01<00:11,  8.91ba/s][A
 13%|█▎        | 16/119 [00:01<00:11,  9.00ba/s][A
 14%|█▍        | 17/119 [00:02<00:11,  8.89ba/s][A
 15%|█▌        | 18/119 [0

[2022-09-14 14:14:53,054][root][INFO] - Could not load query meta from /home/atsvigun/active_learning/examples/workdir/run_active_learning/2022-09-14/14-10-22_42_roberta_base_custom_strategy_least_confidence/query_meta.json: [Errno 2] No such file or directory: '/home/atsvigun/active_learning/examples/workdir/run_active_learning/2022-09-14/14-10-22_42_roberta_base_custom_strategy_least_confidence/query_meta.json'
[2022-09-14 14:14:53,054][root][INFO] - Query meta: []
[2022-09-14 14:14:53,054][root][INFO] - Dumping query meta to /home/atsvigun/active_learning/examples/workdir/run_active_learning/2022-09-14/14-10-22_42_roberta_base_custom_strategy_least_confidence/query_meta.json
[2022-09-14 14:14:53,058][root][INFO] - ### Uncertainties of the queries ###
[2022-09-14 14:14:53,058][root][INFO] - 0.66093, 0.64649, 0.64611, 0.64521, 0.64393, 0.64257, 0.62197, 0.62128, 0.61721, 0.61567, 0.61534, 0.60652, 0.60607, 0.60577, 0.6007, 0.60004, 0.59606, 0.58962, 0.5889, 0.58716, 0.58371, 0.58286, 


  0%|          | 0/2 [00:00<?, ?ba/s][A
100%|██████████| 2/2 [00:00<00:00, 11.25ba/s][A


[2022-09-14 14:14:54,541][root][INFO] - AL iteration 1:
Acquisition_Evaluate_Query model:
[2022-09-14 14:14:54,543][root][INFO] - test_loss: 1.1151825189590454
[2022-09-14 14:14:54,543][root][INFO] - test_accuracy: 0.44083333333333335
[2022-09-14 14:14:54,543][root][INFO] - test_f1_micro: 0.44083333333333335
[2022-09-14 14:14:54,543][root][INFO] - test_f1_macro: 0.4222768305283723
[2022-09-14 14:14:54,543][root][INFO] - test_f1_weighted: 0.4523392182981729
[2022-09-14 14:14:54,543][root][INFO] - test_runtime: 1.1288
[2022-09-14 14:14:54,543][root][INFO] - test_samples_per_second: 1063.107
[2022-09-14 14:14:54,543][root][INFO] - test_steps_per_second: 10.631
[2022-09-14 14:14:54,554][root][INFO] - Training dataset size: 2400
[2022-09-14 14:14:54,554][root][INFO] - Validation dataset size: 7600



  0%|          | 0/3 [00:00<?, ?ba/s][A
 33%|███▎      | 1/3 [00:00<00:00,  6.53ba/s][A
100%|██████████| 3/3 [00:00<00:00,  9.41ba/s][A


[2022-09-14 14:15:03,136][root][INFO] - Load best at end: False
[2022-09-14 14:15:03,140][root][INFO] - TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=True,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=True,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
igno



[2022-09-14 14:17:53,845][root][INFO] - Done with the model fit.
[2022-09-14 14:17:53,857][root][INFO] - ############### Evaluating the acquisition model. ###############



  0%|          | 0/8 [00:00<?, ?ba/s][A
 12%|█▎        | 1/8 [00:00<00:01,  5.82ba/s][A
 25%|██▌       | 2/8 [00:00<00:00,  7.39ba/s][A
 38%|███▊      | 3/8 [00:00<00:00,  8.29ba/s][A
 50%|█████     | 4/8 [00:00<00:00,  8.74ba/s][A
 62%|██████▎   | 5/8 [00:00<00:00,  9.08ba/s][A
 75%|███████▌  | 6/8 [00:00<00:00,  9.34ba/s][A
100%|██████████| 8/8 [00:00<00:00,  9.35ba/s][A


[2022-09-14 14:18:02,011][root][INFO] - AL iteration 1:
Acquisition model:
[2022-09-14 14:18:02,015][root][INFO] - test_loss: 0.5024663209915161
[2022-09-14 14:18:02,017][root][INFO] - test_accuracy: 0.9232894736842105
[2022-09-14 14:18:02,018][root][INFO] - test_f1_micro: 0.9232894736842105
[2022-09-14 14:18:02,018][root][INFO] - test_f1_macro: 0.923027730225176
[2022-09-14 14:18:02,019][root][INFO] - test_f1_weighted: 0.923027730225176
[2022-09-14 14:18:02,019][root][INFO] - test_runtime: 7.0611
[2022-09-14 14:18:02,019][root][INFO] - test_samples_per_second: 1076.32
[2022-09-14 14:18:02,019][root][INFO] - test_steps_per_second: 10.763


AL queries done:   7%|▋         | 1/15 [05:09<1:12:08, 309.17s/it]
  0%|          | 0/118 [00:00<?, ?ba/s][A
  1%|          | 1/118 [00:00<00:18,  6.23ba/s][A
  2%|▏         | 2/118 [00:00<00:16,  7.03ba/s][A
  3%|▎         | 3/118 [00:00<00:14,  7.76ba/s][A
  3%|▎         | 4/118 [00:00<00:13,  8.17ba/s][A
  4%|▍         | 5/118 [00:00<00:13,  8.43ba/s][A
  5%|▌         | 6/118 [00:00<00:13,  8.55ba/s][A
  6%|▌         | 7/118 [00:00<00:12,  8.75ba/s][A
  7%|▋         | 8/118 [00:00<00:12,  8.87ba/s][A
  8%|▊         | 9/118 [00:01<00:12,  8.95ba/s][A
  8%|▊         | 10/118 [00:01<00:11,  9.04ba/s][A
  9%|▉         | 11/118 [00:01<00:11,  9.03ba/s][A
 10%|█         | 12/118 [00:01<00:11,  9.10ba/s][A
 11%|█         | 13/118 [00:01<00:11,  9.03ba/s][A
 12%|█▏        | 14/118 [00:01<00:11,  9.04ba/s][A
 13%|█▎        | 15/118 [00:01<00:11,  9.12ba/s][A
 14%|█▎        | 16/118 [00:01<00:11,  9.13ba/s][A
 14%|█▍        | 17/118 [00:01<00:10,  9.19ba/s][A
 15%|█▌        

Error while terminating subprocess (pid=14253): 


The results will be located in the file `workdir/run_active_learning/TODAY_DATE/TIME_SEED_MODEL/acquisition_metrics.json`.