<a href="https://colab.research.google.com/github/nicolay-r/book-persona-retriever/blob/master/parlai_gutenberg_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Literature Dialogue Response Task (LDR) 📚 • [![twitter](https://img.shields.io/twitter/url/https/shields.io.svg?style=social)](https://x.com/nicolayr_/status/1801009815784677862)
![](https://img.shields.io/badge/Python-3.8-lightgreen.svg)
[![twitter](https://img.shields.io/twitter/url/https/shields.io.svg?style=social)](https://x.com/nicolayr_/status/1801009815784677862)
[![Youtube badge](https://img.shields.io/badge/-Youtube-Cc4c4c?style=flat-square&logo=Youtube&logoColor=white&link=https://twitter.com/nicolayr_)](https://youtu.be/UQQsXfZyjjc)

[![](https://markdown-videos-api.jorgenkh.no/youtube/UQQsXfZyjjc)](https://youtu.be/UQQsXfZyjjc)

> ⚠️ **Disclaimer**: this repository setups the task for the predefined `train` and `valid` splits. In order to replicate studies on different splits you have to manually update the related parts.
> We believe that ParlAI supports task initialization in Cross-Validation mode, however it goes beyond the capabilities of this project version.

This repository represent a supplementary material for the [`nicolay-r/book-persona-retreiver`](https://github.com/nicolay-r/book-persona-retriever) experiments organization 🧪 mentioneed in paper
[Personality Profiling for Literary Character Dialogue Agents with Human Level Attributes (**pre-print**)](https://www.dropbox.com/scl/fi/0c2axh97hadolwphgu7it/rusnachenko2024personality.pdf?rlkey=g2yyzv01th2rjt4o1oky0q8zc&st=omssztha&dl=1)
that has been accepted for *Long Paper* track at [LOD-2024](https://lod2024.icas.events/).


In [None]:
# select python version
!sudo apt-get install python3.8 --fix-missing
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
!sudo update-alternatives --config python3
# check python version
!python --version
# install pip for new python
!sudo apt-get install python3.8-distutils
!wget https://bootstrap.pypa.io/pip/get-pip.py
!sudo python get-pip.py

In [None]:
!python3 --version

Python 3.8.18


In [None]:
!rm -rf parlai_bookchar_task
!git clone https://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@github.com/nicolay-r/parlai_bookchar_task.git

In [None]:
!pip install parlai pytorch-pretrained-bert
# Install py-rouge metrics
!pip install py-rouge
!python -c "import nltk; nltk.download('punkt')"
!pip install transformers

In [None]:
!rm -rf "/usr/local/lib/python3.10/dist-packages/data/GutenbertBookChars"
!cd parlai_bookchar_task && ./setup_colab.sh

## Dataset downloading

Here is the link for the [dataset-v4](https://drive.google.com/drive/folders/1Xz71KeBUurVWNs5XOaqCRA7BkYRt_bBM?usp=sharing) which is expected to be downloaded and then locally copied into `./parlai/`

This is a **BERT-Bi-Ranker** application.

It is supposed to be pretrained first on the ConvAI2 data!

In [None]:
!parlai train_model -t gutenbergbookchars -m bert_ranker/bi_encoder_ranker --batchsize 20 -veps 1 --num-epochs 10 \
--save-after-valid True --log_every_n_steps 500 --tensorboard_log True --model_file ./parlai_bert/bert_biencoder_test --fp16 True --truncate 360 \
--candidates batch --dict-tokenizer bpe --dict-lower True --history-size -1 --optimizer adam -lr 5e-05 --data-parallel True

Random selection application

In [None]:
!parlai eval_model -m random_candidate -t gutenbergbookchars:Spectrum

Mem neural network application
`-mf` denotes the **model file** to load/save the model

In [None]:
!parlai train_model -m memnn -t gutenbergbookchars --model_file ./parlai/memnn-origin -veps 1 -eps 20 \
--save-after-valid True --log_every_n_steps 5000 --tensorboard_log True --batchsize 128 -lr 2 \
  --dynamic-batching full --truncate 320

In [None]:
!parlai train_model -m memnn --init-model ./parlai/memnn-origin  -t gutenbergbookchars:Spectrum --model_file ./parlai/memnn-spectrum -veps 1 -eps 20 \
--save-after-valid True --log_every_n_steps 5000 --tensorboard_log True --batchsize 128 -lr 0.5

In [None]:
!parlai train_model -m tfidf_retriever -t gutenbergbookchars -mf ./parlai/gutenbertbookchars_tfidf \
 -eps 1 --datatype train:ordered  --tensorboard_log True

# IR baseline

**`NOTE:`**` the non-trained version is worse, so it is better to refer a pretrained zoo`

In [None]:
!parlai eval_model -m ir_baseline -t gutenbergbookchars

In [None]:
!parlai eval_model -m ir_baseline -t gutenbergbookchars -mf zoo:wikipedia_full/tfidf_retriever/model

In [None]:
!parlai eval_model -m ir_baseline -t gutenbergbookchars:Spectrum -mf zoo:wikipedia_full/tfidf_retriever/model

# IR-baseline model (dict)
IR-baseline model, trained with the dict vocabulary.
We limit the `-eps` to `5` according to the preliminary analysis here:
https://docs.google.com/spreadsheets/d/1-_lJ-wfSlscyM1un1DdMw_xsuuD8U6GR_MmL1_iF0uY/edit#gid=0

In [None]:
!parlai train_model -m ir_baseline -t gutenbergbookchars \
  --dict-file ./parlai/gutenbergbookchars.dict -veps 1 -eps 5 \
  --model-file ./parlai/ir_baseline_dict --tensorboard_log True

In [None]:
!parlai train_model -m ir_baseline -t gutenbergbookchars:Spectrum \
  --init-model ./parlai/ir_baseline_dict \
  --dict-file ./parlai/gutenbergbookchars_spectrum.dict -veps 1 -eps 5 \
  --model-file ./parlai/ir_baseline_spectrum_dict --tensorboard_log True

In [None]:
!parlai eval_model -m ir_baseline -mf parlai/ir_baseline_dict -t gutenbergbookchars --metrics all

In [None]:
!parlai eval_model -m ir_baseline -mf parlai/ir_baseline_dict -t gutenbergbookchars:Spectrum --metrics all

In [None]:
!parlai eval_model -m ir_baseline -mf parlai/ir_baseline_spectrum_dict -t gutenbergbookchars:Spectrum --metrics all

In [None]:
!parlai eval_model -m ir_baseline -mf parlai/ir_baseline_spectrum_dict -t gutenbergbookchars --metrics all

# Save content onto GDRIVE

In [None]:
cp -r /content/parlai/model_poly_* /content/gdrive/MyDrive/work-NewCastle/my-studies/dataset-v4.1/parlai/

# Transformers

ConvAI2 application of these models:
To the certain extent correct, but we keep only information about persona
without mentioning the exact type of the persona (at the moment, dataset v3)

In [None]:
!pip install tensorrt

In [None]:
!parlai eval_model -mf zoo:pretrained_transformers/model_poly/model -t gutenbergbookchars \
 --eval-candidates inline --batchsize 20 --text-truncate 360 --dynamic-batching full

In [None]:
!parlai eval_model -mf zoo:pretrained_transformers/model_poly/model -t gutenbergbookchars:Spectrum \
 --eval-candidates inline --batchsize 20  --text-truncate 360 --dynamic-batching full

In [None]:
!parlai eval_model -mf zoo:pretrained_transformers/model_bi/model -t gutenbergbookchars:Spectrum \
 --eval-candidates inline --batchsize 20 --text-truncate 360 --dynamic-batching full

# Fine-tunning pretrained ConvAI2 models
Fine-tunning model on the original dataset **without human level attributes**

https://parl.ai/projects/polyencoder/

Follow this tread in order to launch fine-tunning:

https://github.com/facebookresearch/ParlAI/issues/2931

In [None]:
!parlai train_model \
    --init-model zoo:pretrained_transformers/model_bi/model \
    --batchsize 32 -t gutenbergbookchars \
    --model transformer/biencoder --eval-batchsize 6 \
    --warmup_updates 100 --lr-scheduler-patience 0 \
    --lr-scheduler-decay 0.4 -lr 5e-05 --data-parallel True \
    --history-size 20 --label-truncate 72 --text-truncate 360 \
    --num-epochs 3.0 --max_train_time 200000 -veps 0.5 -vme 8000 \
    --validation-metric accuracy --validation-metric-mode max \
    --save-after-valid True --log_every_n_secs 20 --candidates batch \
    --dict-tokenizer bpe --dict-lower True --optimizer adamax \
    --output-scaling 0.06 \
     --variant xlm --reduction-type mean --share-encoders False \
     --learn-positional-embeddings True --n-layers 12 --n-heads 12 \
     --ffn-size 3072 --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 \
     --n-positions 1024 --embedding-size 768 --activation gelu \
     --embeddings-scale False --n-segments 2 --learn-embeddings True \
     --share-word-embeddings False --dict-endtoken __start__ --fp16 True \
     --model-file ./parlai/model_bi_finetuned --tensorboard_log True

In [None]:
!parlai train_model \
  --init-model zoo:pretrained_transformers/poly_model_huge_reddit/model \
  -t gutenbergbookchars \
  --model transformer/polyencoder --batchsize 20 --eval-batchsize 10 \
  --warmup_updates 100 --lr-scheduler-patience 0 --lr-scheduler-decay 0.4 \
  -lr 5e-05 --data-parallel True --history-size 20 --label-truncate 72 \
  --text-truncate 360 --num-epochs 8.0 --max_train_time 200000 -veps 0.5 \
  -vme 8000 --validation-metric accuracy --validation-metric-mode max \
  --save-after-valid True --log_every_n_secs 20 --candidates batch --fp16 True \
  --dict-tokenizer bpe --dict-lower True --optimizer adamax --output-scaling 0.06 \
  --variant xlm --reduction-type mean --share-encoders False \
  --learn-positional-embeddings True --n-layers 12 --n-heads 12 --ffn-size 3072 \
  --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 --n-positions 1024 \
  --embedding-size 768 --activation gelu --embeddings-scale False --n-segments 2 \
  --learn-embeddings True --polyencoder-type codes --poly-n-codes 64 \
  --poly-attention-type basic --dict-endtoken __start__ \
  --model-file ./parlai/model_poly_finetuned --tensorboard_log True

# Eval fine-tuned model on dataset with spectrums

Important: mention version of the model with `.checkpoint`

https://github.com/facebookresearch/ParlAI/issues/2904

In [None]:
!parlai eval_model -mf ./parlai/model_bi_finetuned.checkpoint -t gutenbergbookchars:Spectrum \
  --eval-candidates inline --batchsize 20

In [None]:
!parlai eval_model -mf ./parlai/model_poly_finetuned.checkpoint -t gutenbergbookchars:Spectrum \
  --eval-candidates inline --batchsize 20

In [None]:
!parlai train_model \
    --init-model ./parlai/model_bi_finetuned.checkpoint \
    --batchsize 32 -t gutenbergbookchars:Spectrum \
    --model transformer/biencoder --eval-batchsize 6 \
    --warmup_updates 100 --lr-scheduler-patience 0 \
    --lr-scheduler-decay 0.4 -lr 5e-05 --data-parallel True \
    --history-size 20 --label-truncate 72 --text-truncate 360 \
    --num-epochs 6.0 --max_train_time 200000 -veps 0.5 -vme 8000 \
    --validation-metric accuracy --validation-metric-mode max \
    --save-after-valid True --log_every_n_secs 20 --candidates batch \
    --dict-tokenizer bpe --dict-lower True --optimizer adamax \
    --output-scaling 0.06 \
     --variant xlm --reduction-type mean --share-encoders False \
     --learn-positional-embeddings True --n-layers 12 --n-heads 12 \
     --ffn-size 3072 --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 \
     --n-positions 1024 --embedding-size 768 --activation gelu \
     --embeddings-scale False --n-segments 2 --learn-embeddings True \
     --share-word-embeddings False --dict-endtoken __start__ --fp16 True \
     --model-file ./parlai/model_bi_spectrums_finetuned --tensorboard_log True

In [None]:
!parlai train_model \
  --init-model ./parlai/model_poly_finetuned.checkpoint \
  -t gutenbergbookchars \
  --model transformer/polyencoder --batchsize 20 --eval-batchsize 10 \
  --warmup_updates 100 --lr-scheduler-patience 0 --lr-scheduler-decay 0.4 \
  -lr 5e-05 --data-parallel True --history-size 20 --label-truncate 72 \
  --text-truncate 360 --num-epochs 8.0 --max_train_time 200000 -veps 0.5 \
  -vme 8000 --validation-metric accuracy --validation-metric-mode max \
  --save-after-valid True --log_every_n_secs 20 --candidates batch --fp16 True \
  --dict-tokenizer bpe --dict-lower True --optimizer adamax --output-scaling 0.06 \
  --variant xlm --reduction-type mean --share-encoders False \
  --learn-positional-embeddings True --n-layers 12 --n-heads 12 --ffn-size 3072 \
  --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 --n-positions 1024 \
  --embedding-size 768 --activation gelu --embeddings-scale False --n-segments 2 \
  --learn-embeddings True --polyencoder-type codes --poly-n-codes 64 \
  --poly-attention-type basic --dict-endtoken __start__ \
  --model-file ./parlai/model_poly_spectrums_finetuned --tensorboard_log True

In [None]:
!parlai eval_model -mf ./parlai/model_bi_spectrums_finetuned.checkpoint -t gutenbergbookchars:Spectrum \
  --eval-candidates inline --batchsize 20

Just in case, check what happens with this model when it is without traits

In [None]:
!parlai eval_model -mf ./parlai/model_bi_spectrums_finetuned.checkpoint -t gutenbergbookchars --batchsize 20

# Generative Models

### GPT-2 Small

In [None]:
!parlai train_model -m hugging_face/gpt2 --add-special-tokens True \
--add-start-token True --gpt2-size small -t gutenbergbookchars -bs 24 \
-mf parlai/gpt-2-small-no-hla -veps 0.5 --tensorboard_log True --num-epochs 6 \
 --fp16 True --text-truncate 360 --dynamic-batching full \
 --validation-metric f1

In [None]:
!parlai train_model -m hugging_face/gpt2 --add-special-tokens True \
--add-start-token True --gpt2-size small -t gutenbergbookchars:Spectrum -bs 8 \
-mf parlai/gpt-2-small-spectrum -veps 0.5 --tensorboard_log True --num-epochs 6 \
 --fp16 True --text-truncate 360 --dynamic-batching full \
--validation-metric f1

### GPT-2 Medium

In [None]:
!parlai train_model -m hugging_face/gpt2 --add-special-tokens True \
--add-start-token True --gpt2-size medium -t gutenbergbookchars -bs 6 \
-mf parlai/gpt-2-medium-no-hla -veps 0.5 --tensorboard_log True --num-epochs 3 \
 --sval True --fp16 True --text-truncate 360 --dynamic-batching full \
 --validation-metric f1

In [None]:
!parlai train_model -m hugging_face/gpt2 --add-special-tokens True \
--add-start-token True --gpt2-size medium -t gutenbergbookchars:Spectrum -bs 6 \
-mf parlai/gpt-2-medium-spectrum -veps 0.5 --tensorboard_log True --num-epochs 2 \
 --fp16 True --text-truncate 360 --dynamic-batching full \
--validation-metric f1

## Infer

In [None]:
!parlai eval_model -mf parlai/gpt-2-small-no-hla -t gutenbergbookchars:Spectrum -bs 34 \
  --fp16 True --dynamic-batching full \
  --metrics ppl,f1,accuracy,rouge,bleu --report-filename "gpt-2-small-no-hla-report.json"

In [None]:
# act as the pre-trained version on non-HLA.
!parlai eval_model -mf parlai/gpt-2-small-spectrum -t gutenbergbookchars:Spectrum -bs 34 \
  --fp16 True --dynamic-batching full \
  --metrics ppl,f1,accuracy,rouge,bleu --report-filename "gpt-2-small-hla-report.json"

In [None]:
# act as the pre-trained version on non-HLA.
!parlai eval_model -mf parlai/gpt-2-medium-no-hla \
  -t gutenbergbookchars:Spectrum -bs 10 --fp16 True --dynamic-batching full \
  --metrics ppl,f1,accuracy,rouge,bleu --report-filename "gpt-2-medium-no-hla-report.json"

In [None]:
!parlai eval_model -mf parlai/gpt-2-medium-spectrum \
  -t gutenbergbookchars:Spectrum -bs 10 \
  --fp16 True --dynamic-batching full --metrics ppl,f1,accuracy,rouge,bleu \
  --report-filename "gpt-2-medium-hla-report.json"

In [None]:
!parlai eval_model -m hugging_face/gpt2 -t gutenbergbookchars \
  -bs 10 --fp16 True --dynamic-batching full --metrics ppl,f1,accuracy,rouge,bleu \
  --report-filename "gpt-2-medium-report.json"

In [None]:
!parlai eval_model -m hugging_face/gpt2 --gpt2-size medium \
  -t gutenbergbookchars:Spectrum -bs 10 --fp16 True \
  --dynamic-batching full --metrics ppl,f1,accuracy,rouge,bleu

# Test Zone

In [None]:
!parlai eval_model --model fixed_response --task dailydialog --fixed-response "how may i help you ?" --metrics rouge

In [None]:
!parlai eval_model -mf ./parlai/model_bi_finetuned.checkpoint -t gutenbergbookchars:S1 \
  --eval-candidates inline --batchsize 20 --report-filename "S1-bi-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_bi_finetuned.checkpoint -t gutenbergbookchars:S2 \
  --eval-candidates inline --batchsize 20 --report-filename "S2-bi-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_bi_finetuned.checkpoint -t gutenbergbookchars:S3 \
  --eval-candidates inline --batchsize 20 --report-filename "S3-bi-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_bi_finetuned.checkpoint -t gutenbergbookchars:S4 \
  --eval-candidates inline --batchsize 20 --report-filename "S4-bi-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_bi_finetuned.checkpoint -t gutenbergbookchars:S5 \
  --eval-candidates inline --batchsize 20 --report-filename "S5-bi-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_poly_finetuned.checkpoint -t gutenbergbookchars:S1 \
  --eval-candidates inline --batchsize 20 --report-filename "S1-poly-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_poly_finetuned.checkpoint -t gutenbergbookchars:S2 \
  --eval-candidates inline --batchsize 20 --report-filename "S2-poly-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_poly_finetuned.checkpoint -t gutenbergbookchars:S3 \
  --eval-candidates inline --batchsize 20 --report-filename "S3-poly-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_poly_finetuned.checkpoint -t gutenbergbookchars:S4 \
  --eval-candidates inline --batchsize 20 --report-filename "S4-poly-model.json"

In [None]:
!parlai eval_model -mf ./parlai/model_poly_finetuned.checkpoint -t gutenbergbookchars:S5 \
  --eval-candidates inline --batchsize 20 --report-filename "S5-poly-model.json"

In [None]:
!parlai eval_model -mf parlai/gpt-2-small-spectrum \
  -t gutenbergbookchars:S1 -bs 10 \
  --fp16 True --dynamic-batching full --metrics ppl,f1,accuracy,rouge,bleu \
  --report-filename "s1-gpt2-small.json"

In [None]:
!parlai eval_model -mf parlai/gpt-2-small-spectrum \
  -t gutenbergbookchars:S2 -bs 10 \
  --fp16 True --dynamic-batching full --metrics ppl,f1,accuracy,rouge,bleu \
  --report-filename "s2-gpt2-small.json"

In [None]:
!parlai eval_model -mf parlai/gpt-2-small-spectrum \
  -t gutenbergbookchars:S3 -bs 10 \
  --fp16 True --dynamic-batching full --metrics ppl,f1,accuracy,rouge,bleu \
  --report-filename "s3-gpt2-small.json"

In [None]:
!parlai eval_model -mf parlai/gpt-2-small-spectrum \
  -t gutenbergbookchars:S4 -bs 10 \
  --fp16 True --dynamic-batching full --metrics ppl,f1,accuracy,rouge,bleu \
  --report-filename "s4-gpt2-small.json"

In [None]:
!parlai eval_model -mf parlai/gpt-2-small-spectrum \
  -t gutenbergbookchars:S5 -bs 10 \
  --fp16 True --dynamic-batching full --metrics ppl,f1,accuracy,rouge,bleu \
  --report-filename "s5-gpt2-small.json"