# End-to-end Training of BART-TL

## 0. Preamble

We must first install the required Python packages:

In [16]:
!python3 -m pip install -r ../requirements.txt

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


## 1. Applying LDA on corpora

The first step is extracting topics from the StackExchange corpora. In this notebook we only experiment with the biology corpus. Using the other corpora is very similar.

In [54]:
# XML file used in the experiment
%env CORPUS_FILE=../corpus/biology.stackexchange.com/Posts.xml
# directory where the LDA models and info will be stored
%env LDA_INFO_PATH=../experiment/lda_info/
# directory where the topic data will be stored
%env TOPICS_PATH=../experiment/topics/
# directory where the NETL-extracted labels will be stored
%env NETL_LABELS_PATH=../experiment/netl_labels/

# directory where the dataset for fine-tuning BART will be stored
%env BART_DATASET_PATH=../experiment/dataset_fairseq/
# where the model will be saved
%env HUGGINGFACE_MODEL_SAVE_PATH=../experiment/bart-tl-all/

env: CORPUS_FILE=../corpus/biology.stackexchange.com/Posts.xml
env: LDA_INFO_PATH=../experiment/lda_info/
env: TOPICS_PATH=../experiment/topics/
env: NETL_LABELS_PATH=../experiment/netl_labels/
env: BART_DATASET_PATH=../experiment/dataset_fairseq/
env: HUGGINGFACE_MODEL_SAVE_PATH=../experiment/bart-tl-all/


In [20]:
!mkdir -p ${LDA_INFO_PATH}
!mkdir -p ${TOPICS_PATH}
!mkdir -p ${NETL_LABELS_PATH}

Time to run the script that applies LDA on the biology corpus from StackExchange:

In [17]:
!python3 ../lda/apply_lda.py \
    --input-file ${CORPUS_FILE} \
    --output-prefix ${LDA_INFO_PATH}/biology \
    --topics-prefix ${TOPICS_PATH}/biology

[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Done pre-processing documents
Done pre-processing corpus
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nlt

[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


After running the script, the `../experiment/lda_info` directory should contain the following files:
- `biology.dct`
- `biology.model`
- `biology.model.expElogbeta.npy`
- `biology.model.id2word`
- `biology.model.state`
- `biology_corpus.pickle`

These files are useful for extracting the noun phrases at step 3 and are saved so you can better inspect LDA-related issues for experiments.

The `../experiment/topics` directory will contain:
- `biology.csv`
- `biology.json`
- `biology_sentences.txt`
- `biology_sentences_raw.txt`

These are files that will be used to generate datasets the BART model will be fine-tuned on.

## 2. Obtaining NETL labels for topics

After extracting the topics from the corpus, these need to be labeled using the NETL method (https://github.com/sb1992/NETL-Automatic-Topic-Labelling-).

The original process proposed by the authors was slightly modified to take into account not only topics as sets of top-n words, but their probabilities in the distribution as well.

In order to run this script, you will need to also download the pre-trained Word2Vec and Doc2Vec models that they use. See [this section](https://github.com/sb1992/NETL-Automatic-Topic-Labelling-#pre-trained-models) from their repository. These will need to be unzipped in the `netl_src/model_run/pre_trained_models/` directory.

__NOTE__: This _will_ take a long time (a few hours, probably). There are 56 topics to process and a message will be printed every time one of them finishes, so you will know approximately how much is left at any moment.

In [50]:
!python3 ../netl_src/model_run/get_labels.py \
    --topics ${TOPICS_PATH}/biology.json \
    --output-dir ${NETL_LABELS_PATH} \
    --output-suffix biology \
    --candidates

/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../netl_src
Extracting candidate labels
Data Gathered
/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../netl_src/model_run/pre_trained_models/doc2vec/docvecmodel.d2v
models loaded
Done unique-ing indices
  model1.wv.syn0norm = (model1.wv.syn0 / sqrt((model1.wv.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
  model1.wv.syn0norm = (model1.wv.syn0 / sqrt((model1.wv.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
Done syn0norm
  model1.wv.syn0 = None
  model1_docvecs_doctag_syn0norm = (model1.docvecs.doctag_syn0 / sqrt((model1.docvecs.doctag_syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)[d_indices]
doc2vec normalized
  model2.wv.syn0norm = (model2.wv.syn0 / sqrt((model2.wv.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
  model2.wv.syn0norm = (model2.wv.syn0 / sqrt((model2.wv.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
  model2.wv.syn0 = None
  model3 = model2.wv.syn0norm[w_indices]
wor

After running the script, the `../experiment/netl_labels/` directory should have the `output_candidates_biology` file. This containts the candidate labels selected by the NETL method that will be one source of labels (and the most important) used in fine-tuning the BART model.

## 3. Creating the BART dataset

In this step, there are multiple options based on what kind of dataset we want to fine-tune BART on. The `BART-TL-ng` model showcased in the paper, for instance, was fine-tuned on a dataset created using the `bart-tl/build_dataset/terms_labels_ngrams/build_dataset_fairseq.py` script, while for `BART-TL-all` it was `bart-tl/build_dataset/terms_labels_sentences_ngrams_nps/build_dataset_fairseq.py`.

Here we will fine-tune a `BART-TL-all` model for a more complete example.

First of all, we need to generate the noun phrases for the topics, since unlike the sentences, these are not extracted when applying LDA in the first step.

__NOTE__: This will also take a long amount of time, similar to the candidate selection at the previous step. After each completed topic, a message will be shown, as before.

In [53]:
!python3 ../lda/extract_noun_phrases.py \
    --lda-path ${LDA_INFO_PATH}/biology.model \
    --dict-path ${LDA_INFO_PATH}/biology.dct \
    --corpus-path ${LDA_INFO_PATH}/biology_corpus.pickle \
    --input-file ${CORPUS_FILE} \
    --output-prefix ${TOPICS_PATH}/biology

[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-t

The `../experiment/topics/` directory should now have the `biology_noun_phrases.txt` file with the extracted noun phrases. Now for creating the dataset for `BART-TL-all`:

In [56]:
!python3 ../bart-tl/build_dataset/terms_labels_sentences_ngrams_nps/build_dataset_fairseq.py \
    --topics ${TOPICS_PATH}/biology.json \
    --candidates-file ${NETL_LABELS_PATH}/output_candidates_biology \
    --sentences-file ${TOPICS_PATH}/biology_sentences_raw.txt \
    --noun-phrases-file ${TOPICS_PATH}/biology_noun_phrases.txt \
    --output-dir ${BART_DATASET_PATH}

[nltk_data] Downloading package stopwords to /Users/cpopa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The `../experiment/dataset_fairseq/` directory should now contain 4 files:
- `train.source`
- `train.target`
- `test.source`
- `test.target`

The `test` files are empty since we intend to use all the samples we have for fine-tuning BART.

## 4. Decide whether to use Huggingface or Fairseq

Time to take a pause and decide the way to fine-tune the BART model on our dataset: __Huggingface__ or __Fairseq__.

In the [original paper](https://www.aclweb.org/anthology/2021.eacl-main.121.pdf), I used [Facebook's Fairseq](https://github.com/pytorch/fairseq/tree/master/examples/bart) for fine-tuning. However, this is a rather difficult process (at least, more difficult than using Huggingface). You would need to pre-process the dataset further, download the [large BART model](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz) from them and have it saved locally, whereas the Huggingface method only requires running a single script with the current progress. Ultimately, I don't see a reason why using one over the other would yield wildly different results.

I will showcase both methods here. If you are interested in getting results as similar to the original paper you can opt for the Fairseq way. On the other hand, if you want to quickly and easily use a model for topic labeling, you should go for Huggingface - and I will soon update the model on https://huggingface.co/models, so you don't even need to fine-tune the model yourself (or go through any of the previous steps altogether, in fact).

## 5a. Huggingface fine-tuning

In [74]:
!echo "" > ../experiment/dataset_fairseq/val.source
!echo "" > ../experiment/dataset_fairseq/val.target

You will need to clone the [`transformers` repository](https://github.com/huggingface/transformers) for the following fine-tuning script to work.

In [76]:
!python3 ../seq2seq/finetune_trainer.py \
    --model_name_or_path facebook/bart-large \
    --learning_rate=3e-5 \
    --do_train \
    --data_dir ${BART_DATASET_PATH} \
    --output_dir ${HUGGINGFACE_MODEL_SAVE_PATH} \
    --max_source_length 128 \
    --task summarization \
    --max_target_length 64 \
    --test_max_target_length 64 \
    --lr_scheduler polynomial \
    --logging_dir ${HUGGINGFACE_MODEL_SAVE_PATH} \
    --warmup_steps 1027 \
    --num_train_epochs 2 \
    --adam_beta1 0.9 \
    --adam_beta2 0.999 \
    --adam_epsilon 1e-08 \
    --label_smoothing 0.1 \
    --weight_decay 0.01 \
    --run_name bart-tl-all-experiment \
    --save_steps 42012 \
    --save_total_limit 2 \
    --max_grad_norm 0.1 \
    --dropout 0.1 \
    --attention_dropout 0.1

Traceback (most recent call last):
  File "/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../seq2seq/finetune_trainer.py", line 367, in <module>
    main()
  File "/Users/cpopa/personal/workspace/bart-tl-topic-label-generation/notebooks/../seq2seq/finetune_trainer.py", line 153, in main
    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
  File "/usr/local/lib/python3.9/site-packages/transformers/hf_argparser.py", line 52, in __init__
    self._add_dataclass_arguments(dtype)
  File "/usr/local/lib/python3.9/site-packages/transformers/hf_argparser.py", line 85, in _add_dataclass_arguments
    elif hasattr(field.type, "__origin__") and issubclass(field.type.__origin__, List):
  File "/usr/local/Cellar/python@3.9/3.9.1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/typing.py", line 829, in __subclasscheck__
    return issubclass(cls, self.__origin__)
TypeError: issubclass() arg 1 must be a class


And that's it! After the fine-tuning is done, you will have all the data (and the model) in `../experiment/bart-tl-all/`.

To generate labels with the new `BART-TL-all` model, you can do this:

In [None]:
import numpy as np
from pathlib import Path
from tqdm import tqdm
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = '../experiment/bart-tl-all/'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# If you want to use GPU, uncomment this line
# model = model.to('cuda')
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = 'business company technology product customer service provide management development system'

batch = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=128)
generated_labels = model.generate(
    input_ids=batch.input_ids,
    attention_mask=batch.attention_mask,
    max_length=15,
    min_length=1,
    do_sample=False,
    num_beams=25,
    length_penalty=1.0,
    repetition_penalty=1.5,
    num_return_sequences=10
)

print('Generated labels: ' + ', '.join(generated_labels))

## 5b. Fairseq fine-tuning

After creating the dataset, it needs to be processed further in the case of Fairseq.

First of all, it needs BPE preprocessing:

In [None]:
!../bart-tl/preprocess/bpe/bpe_preprocess.sh

In [None]:
Afterwards, it 