This repository contains the code used for the paper:
This code was originally forked from the awd-lstm-lm and MoS-awd-lstm-lm.
Except the method in our paper, we also implement a recent proposed regularization called PartialShuffle. We find that combining this techique with our method can further improve the performance for langauge models.
The model comes with instructions to train: word level language models over the Penn Treebank (PTB), WikiText-2 (WT2), and WikiText-103 (WT103) datasets. (The code and pre-trained model for WikiText-103 will be merged into the branch soon.)
If you use this code or our results in your research, you can choose to cite:
@InProceedings{pmlr-v97-wang19f,
title = {Improving Neural Language Modeling via Adversarial Training},
author = {Wang, Dilin and Gong, Chengyue and Liu, Qiang},
booktitle = {Proceedings of the 36th International Conference on Machine Learning},
pages = {6555--6565},
year = {2019},
editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
volume = {97},
series = {Proceedings of Machine Learning Research},
address = {Long Beach, California, USA},
month = {09--15 Jun},
publisher = {PMLR},
}
Although the repo is implemented in pytorch 0.4, we have found that the post process can only work well with pytorch 0.2. Therefore, we add a patch for dynamic evaluation and it should be run under pytorch 0.2.
We are now trying to fix this problem. If you have any idea, feel free to talk with us.
Open the folder mos-awd-lstm-lm and you can use the MoS-awd-lstm-lm, which can achieve good performance but also cost a lot of time.
We first list the results without dynamic evaluation:
| Method | Valid PPL | Test PPL |
|---|---|---|
| MoS | 56.54 | 54.44 |
| MoS + PartialShuffle | 55.89 | 53.92 |
| MoS + Adv | 55.08 | 52.97 |
| MoS + Adv + PartialShuffle | 54.10 | 52.20 |
If you want to use Adv only, run the following command:
python3 -u main.py --data data/penn --dropouti 0.4 --dropoutl 0.29 --dropouth 0.225 --seed 28 --batch_size 12 --lr 20.0 --epoch 1000 --nhid 960 --nhidlast 620 --emsize 280 --n_experts 15 --save PTB --single_gpu --switch 200python3 -u finetune.py --data data/penn --dropouti 0.4 --dropoutl 0.29 --dropouth 0.225 --seed 28 --batch_size 12 --lr 25.0 --epoch 1000 --nhid 960 --emsize 280 --n_experts 15 --save PATH_TO_FOLDER --single_gpu -gaussian 0 --epsilon 0.028cp PATH_TO_FOLDER/finetune_model.pt PATH_TO_FOLDER/model.ptand runpython3 -u finetune.py --data data/penn --dropouti 0.4 --dropoutl 0.29 --dropouth 0.225 --seed 28 --batch_size 12 --lr 25.0 --epoch 1000 --nhid 960 --emsize 280 --n_experts 15 --save PATH_TO_FOLDER --single_gpu -gaussian 0 --epsilon 0.028(tiwce)cp PATH_TO_FOLDER/finetune_model.pt PATH_TO_FOLDER/model.ptand runpython3 -u finetune.py --data data/penn --dropouti 0.4 --dropoutl 0.5 --dropouth 0.225 --seed 28 --batch_size 12 --lr 25.0 --epoch 1000 --nhid 960 --emsize 280 --n_experts 15 --save PATH_TO_FOLDER --single_gpu -gaussian 0 --epsilon 0.028source search_dy_hyper.shto search the hyper-parameter for dynamic evaluation (lambda, epsilon, learning rate) on validation set, and then apply it on test set.
To use PartialShuffle, add a command --partial, we try to use PartialShuffle only in the last finetune and get 54.92 / 52.78 (validation / testing). You can download the pretrained-model along with the log file or train it from scratch.
If you want to use Adv only, Run the following command:
python3 -u main.py --epochs 1000 --data data/wikitext-2 --save WT2 --dropouth 0.2 --seed 1882 --n_experts 15 --nhid 1150 --nhidlast 650 --emsize 300 --batch_size 15 --lr 15.0 --dropoutl 0.29 --small_batch_size 5 --max_seq_len_delta 20 --dropouti 0.55 --single_gpu --switch 200python3 -u finetune.py --epochs 1000 --data data/wikitext-2 --save PATH_TO_FOLDER --dropouth 0.2 --seed 1882 --n_experts 15 --nhid 1150 --emsize 300 --batch_size 15 --lr 20.0 --dropoutl 0.29 --small_batch_size 5 --max_seq_len_delta 20 --dropouti 0.55 --single_gpu -gaussian 0 --epsilon 0.028cp PATH_TO_FOLDER/finetune_model.pt PATH_TO_FOLDER/model.ptand runpython3 -u finetune.py --epochs 1000 --data data/wikitext-2 --save PATH_TO_FOLDER --dropouth 0.2 --seed 1882 --n_experts 15 --nhid 1150 --emsize 300 --batch_size 15 --lr 20.0 --dropoutl 0.29 --small_batch_size 5 --max_seq_len_delta 20 --dropouti 0.55 --single_gpu -gaussian 0 --epsilon 0.028(twice)cp PATH_TO_FOLDER/finetune_model.pt PATH_TO_FOLDER/model.ptand runpython3 -u finetune.py --epochs 1000 --data data/wikitext-2 --save PATH_TO_FOLDER --dropouth 0.2 --seed 1882 --n_experts 15 --nhid 1150 --emsize 300 --batch_size 15 --lr 20.0 --dropoutl 0.5 --small_batch_size 5 --max_seq_len_delta 20 --dropouti 0.55 --single_gpu -gaussian 0 --epsilon 0.028source search_dy_hyper.shto search the hyper-parameter for dynamic evaluation (lambda, epsilon, learning rate) on validation set, and then apply it on test set.
To use PartialShuffle, add a command --partial.
Open the folder awd-lstm-lm and you can use the awd-lstm-lm, which can achieve good performance and cost less time.
Run the following command:
nohup python3 -u main.py --nonmono 5 --batch_size 20 --data data/penn --dropouti 0.3 --dropouth 0.25 --dropout 0.40 --alpha 2 --beta 1 --seed 141 --epoch 4000 --save ptb.pt --switch 200 >> ptb.log 2>&1 &source search_dy_hyper.shto search the hyper-parameter for dynamic evaluation (lambda, epsilon, learning rate) on validation set, and then apply it on test set.
You can download the pretrained-model along with the log file or train it from scratch.
Run the following command:
nohup python3 -u main.py --epochs 4000 --nonmono 5 --emsize 400 --batch_size 80 --dropouti 0.5 --data data/wikitext-2 --dropouth 0.2 --seed 1882 --save wt2.pt --gaussian 0.175 --switch 200 >> wt2.log 2>&1 &source search_dy_hyper.shto search the hyper-parameter for dynamic evaluation (lambda, epsilon, learning rate) on validation set, and then apply it on test set.
You can download the pretrained-model along with the log file or train it from scratch.