TreeMix: Compositional Constituency-based Data Augmentation for Natural Language Understanding (NAACL 2022)

Pytorch Implementation of TreeMix

Abstract

Data augmentation is an effective approach to tackle over-fitting. Many previous works have proposed different data augmentations strategies for NLP, such as noise injection, word replacement, back-translation etc. Though effective, they missed one important characteristic of language–compositionality, meaning of a complex expression is built from its subparts. Motivated by this, we propose a compositional data augmentation approach for natural language understanding called TreeMix. Specifically, TreeMix leverages constituency parsing tree to decompose sentences into constituent sub-structures and the Mixup data augmentation technique to recombine them to generate new sentences. Compared with previous approaches, TreeMix introduces greater diversity to the samples generated and encourages models to learn compositionality of NLP data. Extensive experiments on text classification and semantic parsing benchmarks demonstrate that TreeMix outperforms current stateof-the-art data augmentation methods.

Code Structure

|__ DATA
	|__ SST2 
		|__ data
			|__ train.csv --> raw train dataset
			|__ test.csv --> raw test dataset
			|__ train_parsing --> consituency parsing results
		|__ generated 
			|__ times2_min0_seed0_0.3_0.1_7k --> augmentation dataset(hugging face dataset format) with 2 times bigger, seed 0, lambda_L=0.1, lambda_U=0.3, total size=7k
		|__ logs --> best results log
		|__ runs --> tensorboard results
			|__ aug --> augmentation only baseline
			|__ raw --> standard baseline
			|__ raw_aug --> TreeMix results
|__ checkpoints 
	|__ best.pt --> Best model checkpoints
|__ process_data / --> Download & Semantic parsing using Stanfordcorenlp tooktiks
	|__ Load_data.py --> Loading raw dataset and augmentation dataset
	|__ get_data.py --> Download all dataset from huggingface dataset and perform constituency parsing to obtain processed dataset
	|__ settings.py --> Hyperparameter settings & Task settings
|__ online_augmentation 
	|__ __init__.py --> Random Mixup 
|__ Augmentation.py --> Subtree Exchange augmentation method based on consituency parsing results for all dataset(single sentence classification, sentence relation classification, SCAN dataset)
|__ run.py --> Train for one dataset
|__ batch_train.py -> Train with different datasets and different settings, specified by giving specific arguments

Getting Started

pip install -r requirements.txt

Note that to successfully run TreeMix, you must install stanfordcorenlp. Please refer to this

corenlp for more information.

Download & Constituency Parsing

cd process_data
python get_data.py --data {DATA} --corenlp_dir {CORENLP}

DATA indicates the dataset name, CORENLP indicates the directory of stanfordcorenlp . After this process, you could get corresponding data folder in DATA/ and train_parsing.csv.

TreeMix Augmentation

python Augmentation.py --data {DATASET} --times {TIMES} --lam1 {LAM1} --lam2 {LAM2} --seeds {SEEDS}

Augmentation with different arguments, will generate #(TIMES)×#(SEEDS) extra dataset. DATASET could be list of data name such as 'sst2 rte', since 'trec' has two versions, you need to input --label_name {} to specify whether the trec-fine or trec-coarse set. Besides, by typing --low_resource , it will generated partial augmentation dataset as well as partial train set. You can modify the hyperparameter lambda_U and lambda_L by changing LAM1 and LAM2 . TIMES could be a list of intergers such as 2,5 to assign the size of the augmentation dataset.

Model Training

python batch_train.py --mode {MODE} --data {DATASET}

Evaluation one dataset for all its augmenation set with specific mode. MODE can be 'raw', 'aug', 'raw_aug', which indicates train the model with raw dataset only, augmentation dataset only and combination of raw and augmentation set respectively. DATASET should be one specific dataset name. This will report all results of a specifc dataset in /log. If not specified, the hyperparameters will be set as in /process_data/settings.py, please look into this file for more arguments information.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
online_augmentation		online_augmentation
process_data		process_data
transformers_doc/pytorch		transformers_doc/pytorch
ACL_2022_TreeMix.pdf		ACL_2022_TreeMix.pdf
Augmentation.py		Augmentation.py
LICENSE		LICENSE
README.md		README.md
batch_train.py		batch_train.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

online_augmentation

online_augmentation

process_data

process_data

transformers_doc/pytorch

transformers_doc/pytorch

ACL_2022_TreeMix.pdf

ACL_2022_TreeMix.pdf

Augmentation.py

Augmentation.py

LICENSE

LICENSE

README.md

README.md

batch_train.py

batch_train.py

requirements.txt

requirements.txt

run.py

run.py

Repository files navigation

TreeMix: Compositional Constituency-based Data Augmentation for Natural Language Understanding (NAACL 2022)

Abstract

Code Structure

Getting Started

Download & Constituency Parsing

TreeMix Augmentation

Model Training

About

Releases

Packages

Languages

License

lezhang7/TreeMix

Folders and files

Latest commit

History

Repository files navigation

TreeMix: Compositional Constituency-based Data Augmentation for Natural Language Understanding (NAACL 2022)

Abstract

Code Structure

Getting Started

Download & Constituency Parsing

TreeMix Augmentation

Model Training

About

Resources

License

Stars

Watchers

Forks

Languages