# End-to-end Training of BART-TL

This notebook will guide you through fine-tuning a BART-TL model according to this paper: https://www.aclweb.org/anthology/2021.eacl-main.121.pdf.

The models showcased there are already available on Huggingface if you need a quick way of generating labels:
- https://huggingface.co/cristian-popa/bart-tl-all
- https://huggingface.co/cristian-popa/bart-tl-ng

## 0. Preamble

We must first install the required Python packages:

In [16]:
!python3 -m pip install -r ../requirements.txt

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


## 1. Applying LDA on corpora

The first step is extracting topics from the StackExchange corpora. In this notebook we only experiment with the biology corpus. Using the other corpora is very similar.

In [54]:
# XML file used in the experiment
%env CORPUS_FILE=../corpus/biology.stackexchange.com/Posts.xml
# directory where the LDA models and info will be stored
%env LDA_INFO_PATH=../experiment/lda_info/
# directory where the topic data will be stored
%env TOPICS_PATH=../experiment/topics/
# directory where the NETL-extracted labels will be stored
%env NETL_LABELS_PATH=../experiment/netl_labels/

# directory where the dataset for fine-tuning BART will be stored
%env BART_DATASET_PATH=../experiment/dataset_fairseq/
# where the model will be saved
%env HUGGINGFACE_MODEL_SAVE_PATH=../experiment/bart-tl-all/

env: CORPUS_FILE=../corpus/biology.stackexchange.com/Posts.xml
env: LDA_INFO_PATH=../experiment/lda_info/
env: TOPICS_PATH=../experiment/topics/
env: NETL_LABELS_PATH=../experiment/netl_labels/
env: BART_DATASET_PATH=../experiment/dataset_fairseq/
env: HUGGINGFACE_MODEL_SAVE_PATH=../experiment/bart-tl-all/


In [20]:
!mkdir -p ${LDA_INFO_PATH}
!mkdir -p ${TOPICS_PATH}
!mkdir -p ${NETL_LABELS_PATH}

Time to run the script that applies LDA on the biology corpus from StackExchange:

In [17]:
!python3 ../lda/apply_lda.py \
    --input-file ${CORPUS_FILE} \
    --output-prefix ${LDA_INFO_PATH}/biology \
    --topics-prefix ${TOPICS_PATH}/biology

[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Done pre-processing documents
Done pre-processing corpus
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nlt

[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


After running the script, the `../experiment/lda_info` directory should contain the following files:
- `biology.dct`
- `biology.model`
- `biology.model.expElogbeta.npy`
- `biology.model.id2word`
- `biology.model.state`
- `biology_corpus.pickle`

These files are useful for extracting the noun phrases at step 3 and are saved so you can better inspect LDA-related issues for experiments.

The `../experiment/topics` directory will contain:
- `biology.csv`
- `biology.json`
- `biology_sentences.txt`
- `biology_sentences_raw.txt`

These are files that will be used to generate datasets the BART model will be fine-tuned on.

## 2. Obtaining NETL labels for topics

After extracting the topics from the corpus, these need to be labeled using the NETL method (https://github.com/sb1992/NETL-Automatic-Topic-Labelling-).

The original process proposed by the authors was slightly modified to take into account not only topics as sets of top-n words, but their probabilities in the distribution as well.

In order to run this script, you will need to also download the pre-trained Word2Vec and Doc2Vec models that they use. See [this section](https://github.com/sb1992/NETL-Automatic-Topic-Labelling-#pre-trained-models) from their repository. These will need to be unzipped in the `netl_src/model_run/pre_trained_models/` directory.

__NOTE__: This _will_ take a long time (a few hours, probably). There are 56 topics to process and a message will be printed every time one of them finishes, so you will know approximately how much is left at any moment.

In [50]:
!python3 ../netl_src/model_run/get_labels.py \
    --topics ${TOPICS_PATH}/biology.json \
    --output-dir ${NETL_LABELS_PATH} \
    --output-suffix biology \
    --candidates

/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../netl_src
Extracting candidate labels
Data Gathered
/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../netl_src/model_run/pre_trained_models/doc2vec/docvecmodel.d2v
models loaded
Done unique-ing indices
  model1.wv.syn0norm = (model1.wv.syn0 / sqrt((model1.wv.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
  model1.wv.syn0norm = (model1.wv.syn0 / sqrt((model1.wv.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
Done syn0norm
  model1.wv.syn0 = None
  model1_docvecs_doctag_syn0norm = (model1.docvecs.doctag_syn0 / sqrt((model1.docvecs.doctag_syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)[d_indices]
doc2vec normalized
  model2.wv.syn0norm = (model2.wv.syn0 / sqrt((model2.wv.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
  model2.wv.syn0norm = (model2.wv.syn0 / sqrt((model2.wv.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
  model2.wv.syn0 = None
  model3 = model2.wv.syn0norm[w_indices]
wor

After running the script, the `../experiment/netl_labels/` directory should have the `output_candidates_biology` file. This containts the candidate labels selected by the NETL method that will be one source of labels (and the most important) used in fine-tuning the BART model.

## 3. Creating the BART dataset

In this step, there are multiple options based on what kind of dataset we want to fine-tune BART on. The `BART-TL-ng` model showcased in the paper, for instance, was fine-tuned on a dataset created using the `bart-tl/build_dataset/terms_labels_ngrams/build_dataset_fairseq.py` script, while for `BART-TL-all` it was `bart-tl/build_dataset/terms_labels_sentences_ngrams_nps/build_dataset_fairseq.py`.

Here we will fine-tune a `BART-TL-all` model for a more complete example.

First of all, we need to generate the noun phrases for the topics, since unlike the sentences, these are not extracted when applying LDA in the first step.

__NOTE__: This will also take a long amount of time, similar to the candidate selection at the previous step. After each completed topic, a message will be shown, as before.

In [53]:
!python3 ../lda/extract_noun_phrases.py \
    --lda-path ${LDA_INFO_PATH}/biology.model \
    --dict-path ${LDA_INFO_PATH}/biology.dct \
    --corpus-path ${LDA_INFO_PATH}/biology_corpus.pickle \
    --input-file ${CORPUS_FILE} \
    --output-prefix ${TOPICS_PATH}/biology

[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-t

The `../experiment/topics/` directory should now have the `biology_noun_phrases.txt` file with the extracted noun phrases. Now for creating the dataset for `BART-TL-all`:

In [56]:
!python3 ../bart-tl/build_dataset/terms_labels_sentences_ngrams_nps/build_dataset_fairseq.py \
    --topics ${TOPICS_PATH}/biology.json \
    --candidates-file ${NETL_LABELS_PATH}/output_candidates_biology \
    --sentences-file ${TOPICS_PATH}/biology_sentences_raw.txt \
    --noun-phrases-file ${TOPICS_PATH}/biology_noun_phrases.txt \
    --output-dir ${BART_DATASET_PATH}

[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The `../experiment/dataset_fairseq/` directory should now contain 4 files:
- `train.source`
- `train.target`
- `test.source`
- `test.target`

The `test` files are empty since we intend to use all the samples we have for fine-tuning BART.

__IMPORTANT NOTE__: The script that creates the dataset does not overwrite the data in the `train.source` and `train.target` files. If you intend to fine-tune on more StackExchange corpora (as was done in the paper), you can run the script for each of them one after the other and the BART dataset will accumulate in the `train` files.

## 4. Decide whether to use Huggingface or Fairseq

Time to take a pause and decide the way to fine-tune the BART model on our dataset: __Huggingface__ or __Fairseq__.

In the [original paper](https://www.aclweb.org/anthology/2021.eacl-main.121.pdf), I used [Facebook's Fairseq](https://github.com/pytorch/fairseq/tree/master/examples/bart) for fine-tuning. However, this is a rather difficult process (at least, more difficult than using Huggingface). You would need to pre-process the dataset further, download the [large BART model](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz) from them and have it saved locally, whereas the Huggingface method only requires running a single script with the current progress. Ultimately, I don't see a reason why using one over the other would yield wildly different results.

I will showcase both methods here. If you are interested in getting results as similar to the original paper you can opt for the Fairseq way. On the other hand, if you want to quickly and easily use a model for topic labeling, you should go for Huggingface. Even moreso, if you don't want to fine-tune anything yourself, you can use the ones available on Huggingface:
- https://huggingface.co/cristian-popa/bart-tl-all
- https://huggingface.co/cristian-popa/bart-tl-ng

## 5a. Huggingface fine-tuning

First of all, you will need two additional empty files in the dataset that are technically used for validation, but we won't do that here:

In [74]:
!echo "" > ../experiment/dataset_fairseq/val.source
!echo "" > ../experiment/dataset_fairseq/val.target

Afterwards, you will need some additional packages:

In [96]:
!pip install torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install "transformers==4.1.1"
!pip install gitpython
!pip install rouge_score
!pip install sacrebleu

Collecting transformers==4.1.1
  Using cached transformers-4.1.1-py3-none-any.whl (1.5 MB)
Collecting tokenizers==0.9.4
  Using cached tokenizers-0.9.4-cp39-cp39-macosx_10_11_x86_64.whl (2.0 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.10.2
    Uninstalling tokenizers-0.10.2:
      Successfully uninstalled tokenizers-0.10.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.4.2
    Uninstalling transformers-4.4.2:
      Successfully uninstalled transformers-4.4.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dsci-event-field-string-embedding 0.0.1 requires tokenizers==0.10.1, but you have tokenizers 0.9.4 which is incompatible.
dsci-event-field-string-embedding 0.0.1 requires transformers==4.4.2, but you have transformers 4.1.

Now you should be able to successfully run the following script. It is originally from the [`transformers` repository](https://github.com/huggingface/transformers), but slightly modified so it doesn't break when the validation and test sets are empty files.

In [76]:
!python3 ../seq2seq/finetune_trainer.py \
    --model_name_or_path facebook/bart-large \
    --learning_rate=3e-5 \
    --do_train \
    --data_dir ${BART_DATASET_PATH} \
    --output_dir ${HUGGINGFACE_MODEL_SAVE_PATH} \
    --max_source_length 128 \
    --task summarization \
    --max_target_length 64 \
    --test_max_target_length 64 \
    --lr_scheduler polynomial \
    --logging_dir ${HUGGINGFACE_MODEL_SAVE_PATH} \
    --warmup_steps 98 \
    --num_train_epochs 2 \
    --adam_beta1 0.9 \
    --adam_beta2 0.999 \
    --adam_epsilon 1e-08 \
    --label_smoothing 0.1 \
    --weight_decay 0.01 \
    --run_name bart-tl-all-experiment \
    --save_steps 1626 \
    --save_total_limit 2 \
    --max_grad_norm 0.1 \
    --dropout 0.1 \
    --attention_dropout 0.1

Traceback (most recent call last):
  File "/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../seq2seq/finetune_trainer.py", line 367, in <module>
    main()
  File "/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../seq2seq/finetune_trainer.py", line 153, in main
    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
  File "/usr/local/lib/python3.9/site-packages/transformers/hf_argparser.py", line 52, in __init__
    self._add_dataclass_arguments(dtype)
  File "/usr/local/lib/python3.9/site-packages/transformers/hf_argparser.py", line 85, in _add_dataclass_arguments
    elif hasattr(field.type, "__origin__") and issubclass(field.type.__origin__, List):
  File "/usr/local/Cellar/python@3.9/3.9.1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/typing.py", line 829, in __subclasscheck__
    return issubclass(cls, self.__origin__)
TypeError: issubclass() arg 1 must be a class


You can adjust the parameters above (in particular, set the `warmup_steps` to 6% of the total amount of steps) to fit your needs, these are the ones used in the paper. And that's it! After the fine-tuning is done, you will have all the data (and the model) in `../experiment/bart-tl-all/`.

To generate labels with the new `BART-TL-all` model, you can do this:

In [6]:
import numpy as np
from pathlib import Path
from tqdm import tqdm
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = '../experiment/bart-tl-all/'
num_labels_to_generate = 10

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# If you want to use GPU, uncomment this line:
# model = model.to('cuda')
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Remember: the model is only trained on the biology corpus!
texts = [
    'virus vaccine influenza infection vaccination pig human hpv disease antibody',
    'diabetes glucose insulin type diabetic cholesterol level lipoprotein control lipid',
    'cardiac heart ventricular patient myocardial failure valve atrial left leave'
]

batch = tokenizer(texts, return_tensors='pt', truncation=True, padding='max_length', max_length=128)
generated_labels = model.generate(
    input_ids=batch.input_ids,
    attention_mask=batch.attention_mask,
    max_length=15,
    min_length=1,
    do_sample=False,
    num_beams=25,
    length_penalty=1.0,
    repetition_penalty=1.5,
    num_return_sequences=num_labels_to_generate
)

generated_labels = tokenizer.batch_decode(generated_labels, skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_labels = np.array(generated_labels).reshape((len(batch.input_ids), num_labels_to_generate))

print(generated_labels)

[['nervous system' 'dna methylation' 'mammary gland' 'susceptibility'
  'rna-seq' 'gene product' 'immune system' 'immune system' 'gene product'
  'gene expression']
 ['glutathione' 'glycolysis' 'dopamine' 'glycerol' 'dopamine'
  'fatty acid' 'glutamate receptor' 'bile acid' 'glycerol' 'metabolism']
 ['pulmonary hypertension' 'skeletal muscle' 'mitochondrion'
  'pericardium' 'thrombosis' 'apoptosis' 'breathing system'
  'pulmonary valve' 'breathing tube' 'heterologous']]


In [1]:
import transformers
print(transformers.__version__)

4.1.1


In [6]:
!python3.8 -m pip install torch==1.6.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0
  Using cached torch-1.6.0-cp38-none-macosx_10_9_x86_64.whl (97.5 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
Successfully installed torch-1.6.0


## 5b. Fairseq fine-tuning

After creating the dataset, it needs to be processed further in the case of Fairseq.

First of all, you need to install the `fairseq` package:

In [None]:
!pip install fairseq==0.9.0

Now the dataset need BPE preprocessing:

In [77]:
!../bart-tl/preprocess/bpe/bpe_preprocess.sh



This will create more files in the `../experiment/dataset_fairseq/` directory.

Afterwards, the dataset needs to be binarized:

In [78]:
!../bart-tl/preprocess/binarization/binarize.sh

Namespace(no_progress_bar=False, log_interval=1000, log_format=None, tensorboard_logdir='', seed=1, cpu=False, fp16=False, memory_efficient_fp16=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer='nag', lr_scheduler='fixed', task='translation', source_lang='source', target_lang='target', trainpref='../experiment/dataset_fairseq/train.bpe', validpref='../experiment/dataset_fairseq/test.bpe', testpref=None, align_suffix=None, destdir='../experiment/dataset_fairseq-bin/', thresholdtgt=0, thresholdsrc=0, tgtdict='../bart-tl/preprocess/bpe/dict.txt', srcdict='../bart-tl/preprocess/bpe/dict.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=False, padding_factor=8, workers=60)
| [source] Dictionary: 50263 types
| [source] ../experiment/dataset_fairseq/train.bpe.so

This will create another directory, `../experiment/dataset_fairseq-bin/` with the final version of the data that will be fed to the BART model.

You will now need to download the `bart.large` model from the Fairseq repository: https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md#pre-trained-models.

Unzip it into the `../experiment/` directory. You will now have a directory named `../experiment/bart.large/` that contains a `model.pt` file. To fine-tune the model, you need to run the following cell.

__IMPORTANT NOTE__: The script does not actually ever end, you need to stop it manually after 2 epochs are finished.

In [81]:
!../bart-tl/finetune/finetune_bart.sh

Namespace(no_progress_bar=False, log_interval=1000, log_format=None, tensorboard_logdir='', seed=1, cpu=False, fp16=False, memory_efficient_fp16=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, criterion='label_smoothed_cross_entropy', tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='polynomial_decay', task='translation', num_workers=1, skip_invalid_size_inputs_valid_test=True, max_tokens=2048, max_sentences=None, required_batch_size_multiple=1, dataset_impl=None, train_subset='train', valid_subset='valid', validate_interval=1, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=2048, max_sentences_valid=None, curriculum=0, distributed_world_size=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='c10d', bucket_cap_mb=25, fix_batches_to_gpus=Fa

          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=1024, out_features=4096, bias=True)
        (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (self_attn_layer_norm): Layer

| loaded checkpoint ../experiment/bart.large/model.pt (epoch 41 @ 0 updates)
| loading train data for epoch 0
| loaded 6498 examples from: ../experiment/dataset_fairseq-bin/train.source-target.source
| loaded 6498 examples from: ../experiment/dataset_fairseq-bin/train.source-target.target
| ../experiment/dataset_fairseq-bin train source-target 6498 examples
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  ../torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg.mul_(beta1).add_(1 - beta1, grad)
| epoch 001 | loss 8.438 | nll_loss 6.990 | ppl 127.11 | wps 7 | ups 0 | wpb 306.227 | bsz 50.766 | num_updates 128 | lr 0 | gnorm 116.923 | clip 1.000 | oom 0.000 | wall 5356 | train_wall 5310
| epoch 001 | valid on 'valid' subset | loss 15.172 | nll_loss 14.692 | ppl 26464.8 | num_updates 128
| saved checkpoint checkpoints/checkpoint1.pt (epoch 1 @ 128 updates) (writing took 28.25100016593

Your model checkpoints should be available in the `notebooks/checkpoints/` directory and, if you stopped the cell after 2 epochs were finished, you will have the following files:
- `checkpoint1.pt`
- `checkpoint2.pt`
- `checkpoint_best.pt`

You only need to keep the `checkpoint2.pt` file, as that's the fine-tuned model after 2 epochs.

To generate labels using it, run:

In [82]:
!python3 ../bart-tl/generate.py \
    --model-path checkpoints/checkpoint2.pt \
    --processed-dataset-path ../experiment/dataset_fairseq-bin/ \
    --topics-file ../experiment/dataset_fairseq/train.source \
    --output-file ../experiment/generated_labels.txt

loading archive file .
| [source] dictionary: 50264 types
| [target] dictionary: 50264 types
Traceback (most recent call last):
  File "/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../bart-tl/generate.py", line 142, in <module>
    main()
  File "/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../bart-tl/generate.py", line 122, in main
    hypotheses_batch.append(bart.sample([topic], num_samples=num_samples, beam=beam, lenpen=2.0, max_len_b=60, min_len=1, no_repeat_ngram_size=10, length_penalty=1.0)[0])
  File "/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../bart-tl/generate.py", line 46, in sample
    sample, translations = self.generate(input, beam, verbose, **kwargs)
  File "/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../bart-tl/generate.py", line 67, in generate
    translations = self.task.inference_step(
  File "/usr/local/lib/python3.9/site-packages/fairseq/tasks/fairseq

Unfortunately, I was not able to get this generation to work, even though the code was the exact same as what I used in experimenting (most probably, some packages got messed up in the meantime). Sorry about this :(

This _should_ generate labels for all the topics in the training file and put them in `../experiment/generated_labels.txt`. The labels are space-separated and each individual one has spaces replaced by `_` characters.