SL4DU

Environment

Option 1: container

frontlibrary/transformers-pytorch-gpu:4.6.1-pyarrow

~$ docker run --runtime=nvidia -it --rm -v $HOME/SL4DU:/workspace frontlibrary/transformers-pytorch-gpu:4.6.1-pyarrow

Option 2: build from scatch

Python==3.9 (There may be some problem for numpy with 3.10)
nltk
numpy (can be automatically installed when installing scipy)
scipy (if you have problem, see this solution)
torch==1.8
pyarrow
tqdm
transformers==4.5.1
sklearn
stop_words

Reproduce step: an example

Initialize directories
```
SL4DU
    code
    data
    pretrained
```

Download code and the Ubuntu data

~/SL4DU/code$ git clone https://github.com/RayXu14/SL4DU.git
~/SL4DU/data$ wget https://www.dropbox.com/s/2fdn26rj6h9bpvl/ubuntu_data.zip
~/SL4DU/data$ unzip ubuntu_data.zip

Add bert-base-uncased pretrained model in pretrained
- config.json
- vocab.txt
- pytorch_model.bin

Preprocess data

~/SL4DU/code/SL4DU$ python3 preprocess.py --task=RS --dataset=Ubuntu --raw_data_path=../../data/ubuntu_data --pkl_data_path=../../data/ubuntu_data --pretrained_model=bert-base-uncased

Reproduce BERT result

~/SL4DU/code/SL4DU$ python3 -u train.py --save_ckpt --task=RS --dataset=Ubuntu --pkl_data_path=../../data/ubuntu_data --pretrained_model=bert-base-uncased --add_EOT --freeze_layers=0 --train_batch_size=8 --eval_batch_size=100 --log_dir=? # --pkl_valid_file=test.pkl

Add post-ubuntu-bert-base-uncased in pretrained
- Download whang's Ubuntu ckpt and use deprecated/whangpth2bin.py to transform it into our form; compared to bert-base-uncased, only need to +1 for vocab size in config.json and add a new word [EOS] after vocab.txt
- Or use our pretrained models (already transformed) instead

Reproduce BERT-VFT result

~/SL4DU/code/SL4DU$ python3 -u train.py --save_ckpt --task=RS --dataset=Ubuntu --pkl_data_path=../../data/ubuntu_data --pretrained_model=post-ubuntu-bert-base-uncased --freeze_layers=8 --train_batch_size=16 --eval_batch_size=100 --log_dir=? #--pkl_valid_file=test.pkl

Reproduce SL4RS result

~/SL4DU/code/SL4DU$ python3 -u train.py --save_ckpt --task=RS --dataset=Ubuntu --pkl_data_path=../../data/ubuntu_data --pretrained_model=post-ubuntu-bert-base-uncased --freeze_layers=8 --train_batch_size=4 --eval_batch_size=100 --log_dir=? --use_NSP --use_UR --use_ID --use_CD --train_view_every=80 #--pkl_valid_file=test.pkl

Evaluation

~/SL4DU/code/SL4DU$ python3 -u eval.py --task=Ubuntu --data_path=../../data/ubuntu_data --pretrained_model=post-ubuntu-bert-base-uncased --freeze_layers=8 --eval_batch_size=100 --log_dir ? --load_path=?

Pretrained on yourself

Using Whang's repo

Remember to transform the saved model to our form using deprecated/whangpth2bin.py.

Additional information for pretraining settings

set the number of epochs as 2 for post-training with 10 duplication data and set the virtual batch size as 384

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
deprecated		deprecated
handler		handler
model		model
util		util
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
config.py		config.py
eval.py		eval.py
preprocess.py		preprocess.py
preprocess_post_training.py		preprocess_post_training.py
train.py		train.py

License

RayXu14/SL4DU

Folders and files

Latest commit

History

Repository files navigation

SL4DU

Environment

Option 1: container

Option 2: build from scatch

Reproduce step: an example

Pretrained on yourself

Additional information for pretraining settings

About

Resources

License

Stars

Watchers

Forks

Languages