This is the repo for paper AugCSE: Contrastive Sentence Embedding with Diverse Augmentation
- AACL 2022 (Link).
We used a diverse set of natural language augmentation and a discriminator loss to improve general sentence embedding
representation (in English). We show that simple rule-based augmentations can be just as effective at improving sentence
representations than language-model based augmentations, which can be expensive and biased.
In this repo, you will find
- 100+ natural language augmentations (from NL-Augmenter)
- Training script for AugCSE with different options(see training arguments):
- loss variations (triplet, ranking (SimCSE), contrastive, etc) link
- projection layer variations (linear, mixture of experts) link
- discriminator variations (binary, predict label, predict order) link
- positive, negative, or neutral augmentations link
- augmentation automatic labeling and sampling variations link
- supervised vs. unsupervised objective link
- single sentence vs. pair sentence classification link
- Evaluation script for STS and SentEval (adapted from SimCSE)
- Various experimentation and analysis scripts
If you have any questions, please contact zilutang@bu.edu
, or submit issues / PRs.
conda create --name augmentation python=3.8
conda activate augmentation
pip install -r requirements.txt
Original as well as augmentation data for wiki1m can be downloaded. Read more here For augmented data, once downloaded, the folder of the zip file (unzipped) should be placed at the root of the project for training script to work property:
AugCSE/dump/{AUGMENTATION}/wiki1m_for_simcse_train_100.csv
To reproduce our results (for STS) with AugCSE framework, run the following training script:
./experiments/run_train_exp.py
The main arguments are explained as follows (for more detailed explanation, see actual arguments):
Arg | Type | Descriptions |
---|---|---|
model_name_or_paths | string (e.g. bert-base-uncased ) |
this allows you to switch to any backbone models supported in huggingface. |
contrastive_loss_type | string {ranking , ranking_pos_neg , ranking_neg , triplet , alignment , contrastive , hard_contrastive } |
we follow the naming convention of the losses after Sentence-BERT "ranking" loss is equivalent to SimCSE objective. In our model, it is combined with discriminative loss to achieve SOTA. "ranking_pos_neg" loss allows you to take in both positive augmentations and negative augmentations, as next example will show |
diff_cse_mode | boolean | this changes loss calculation such that augmentations aren't used during contrastive loss but during discriminative loss. Turn this off and you use augmentations in both losses. |
positive_augmentations | List [string] (comma separated) | In any of positive, negative, or neutral augmentations, you can specify a list of augmentation names with comma separated. These are class names (not directory name) of the augmentations in the transformations/ directory. You may also build your own augmentations and invoke them here. In AugCSE setting, positive augmentations are used for discriminative loss and not contrastive loss |
uniform_augmentation_sampling | boolean | At every datapoint index, whether to select uniformly from one of the augmentations specified. It has to be false for using positives and negatives (in second example training script) |
remove_no_augmentations | boolean | Each augmentation has a perturbation rate. It may not be applicable for every sentence. Therefore, this argument dictates whether we throw away datapoints that are not augmented. |
In the second example training script, you can train a model with positive augmentation as original sentence (SimCSE objective) and RandomDeletion as negative augmentation. Or, switch it with whatever augmentation you want! This is the script we used to obtain results in the motivational case studies experiments
./experiments/run_train_pos_neg_simcse.py
The important arguments here are:
Arg | Type | Descriptions |
---|---|---|
negative_augmentations | List [string] (comma separated) | List of augmentation to use as negatives (in addition to in-batch negatives). Since in contrastive loss, there has to be a positive paired with negatives, if a positive is not specified, then it defaults to the original sentence (with independent dropout). |
- If you want to just try using different augmentations as positives, and only positives:
--positive_augmentations "aug1,aug2,..."
--contrastive_loss_typo "ranking" # or "triplet", "alignment", "contrastive", "hard_contrastive"
- if you want to resample augmentations (since we pick an augmentation method at each index, and all augmented datasets are cached):
--resample_augmentation_per_epoch true
- if you are not sure whether the augmentation should be positive or negative? We added
neutral_augmentations
just for this reason:
--neutral_augmentations "aug1,aug2,..."
--augmentation_label_method "lm_uniform"
--augmentation_label_ppl_scalar 1.0
--augmentation_label_sim_scalar 0.2
Now here are a couple of new args that we should explain:
Arg | Type | Descriptions |
---|---|---|
neutral_augmentations | List [string] (comma separated) | List of augmentation to use. Whether it is used as positive or negative is determined through augmentation_label_method |
augmentation_label_method | string {pessimistic ,optimistic ,lm_uniform } |
Pessimistic treats all augmentations as negative augmentations, Optimistic treats all augmentation as positive augmentations, lm_uniform uses perplexity difference or similarity between augmented sentence / original sentence as labels for positivity. (ppl and sim needs to be pre-computed) |
augmentation_label_ppl_scalar | float | Scaling factor multiplied to perplexity standard deviation to determine negative augmentations. The larger this value the more positive labels |
augmentation_label_sim_scalar | float | Scaling factor multiplied to similarity standard deviation to determine negative augmentations. The larger this value the more negative labels |
Now we should warn you, this may seem like a cool idea, it has not worked well for us. However, this dynamic augmentation labeling problem should theoretically be able to solve the issue in NLP augmentation, where the semantic changes brought by augmentation to each sentence is not uniform. Through some metric in the future, we hope to be able to quantify that, and find the exact heuristics that can automatically label.
- You want to use a new augmentation that is not already cached and downloadable through the drive folder. How does the cache gets created in the first place? How do you specify augmentation specific arguments?
Our code works like this. If the augmentation cache is not found, the augmentation method is run over the whole input
dataset with N datapoints. Once all augmentations are created (N X A, where A is number of augmentations you want to use)
We go through each datapoints 1:N, and at each index, we pick an augmentation, and combine all samples into another
cached dataset. This is where resample_augmentation_per_epoch
or regenerate_augmentation_per_epoch
comes into play.
But ideally, you generate augmentations once, and you are done (because some, especially langauge model based
augmentations are very expensive to run). Now, here are a couple of args related to augmentation generations:
Arg | Type | Descriptions |
---|---|---|
augmentation_batch_size | integer | If an augmentation supports batched transformation (important for GPU based ones), you may specify it here. An example of batched transformation is DiverseParaphrase |
augmentation_init_args | Dict [string, Any] | An augmentation may have parameters to themselves (e.g. random drop rate), in these cases you can pass in a JSON encoded kwargs as input argument to initialize those augmentations. Example transformation is RandomWordAugmentation |
For full list of augmentations, you can browse the /transformations
directory, checkout the NLAugmentor Paper Appendix
Section B, or go to their most updated repo and see if there are new interesting
augmentations updated. If you still can't find anything, just find a directory within /transformations
and copy the
structure. We created several (e.g. RandomWordAugmentation
) for our experiments.
the directory would look like
/transformations
|- /myAugmentation
|----__init__.py
|----README.md
|----requirements.txt
|----transformation.py
|----test.json
Most of the files are related to NLAugmentor robustness test. the most important file is
the transformation.py
file.
from typing import List
from interfaces.SentenceOperation import SentenceOperation
from tasks.TaskTypes import TaskType
class myInvertAugmentation(SentenceOperation):
tasks = [TaskType.TEXT_CLASSIFICATION, TaskType.TEXT_TO_TEXT_GENERATION]
# class variables like this can be for filtering / search
languages = ["en"]
keywords = []
def __init__(self, add_tag: bool=False, seed=0):
"""
init methods can contain arguments which can be initialized in
training scripts through `augmentation_init_args`
"""
super().__init__(seed)
self.add_tag = add_tag
def generate(self, sentence: str) -> List[str]:
"""
during generation, each example is called upon with this method
each sentence is expected to return a list of augmentations (could
be more than 1). In our training framework, we randomly select one
if more than 1 examples is returned
"""
augmented_text = " ".join(sentence.split()[::-1])
if self.add_tag:
augmented_text += " very nice."
return [augmented_text]
def generate_batch(self, sentences: List[str]) -> List[List[str]]:
"""
batched generation is very useful for GPU intense augmentations
"""
augmented_sentences = [self.generate(s) for s in sentences]
return augmented_sentences
And just like that, your augmentation can be used in your training script (of course after it is being cached). For example, you can train it as a positive augmentation using SimCSE objective:
--positive_augmentations "myInvertAugmentation"
--contrastive_loss_typo "ranking"
For example evaluation on STS or SentEval, first download SentEval dataset
cd SentEval/data/downstream && ./download_dataset.sh
Then, run evaluation script:
./run_eval.sh
For best hyperparameters, please refer to paper Appendix A.7 for detail.
This paper only shows the beginning of usefulness of rule based augmentations (especially in low-resource settings). A robust domain-agnostic sentence embedding can be useful for many downstream task, from natural language understanding to text generation. Specific rule based augmentations can also be extremely valuable in instilling invanriances within the model, which becomes especially valuable in AI fairness (discriminative loss vs. instance contrastive loss also relates nicely to improving group and individual fairness). If you want to collaborate on any of the above areas, feel free to shoot us an email! We are excited to hear how our repo can be made useful for your experimentation!
To cite our paper, please use:
@inproceedings{tang-etal-2022-augcse,
title = "{A}ug{CSE}: Contrastive Sentence Embedding with Diverse Augmentations",
author = "Tang, Zilu and
Kocyigit, Muhammed Yusuf and
Wijaya, Derry Tanti",
booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = nov,
year = "2022",
address = "Online only",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.aacl-main.30",
pages = "375--398",
abstract = "Data augmentation techniques have been proven useful in many applications in NLP fields. Most augmentations are task-specific, and cannot be used as a general-purpose tool. In our work, we present AugCSE, a unified framework to utilize diverse sets of data augmentations to achieve a better, general-purpose, sentence embedding model. Building upon the latest sentence embedding models, our approach uses a simple antagonistic discriminator that differentiates the augmentation types. With the finetuning objective borrowed from domain adaptation, we show that diverse augmentations, which often lead to conflicting contrastive signals, can be tamed to produce a better and more robust sentence representation. Our methods achieve state-of-the-art results on downstream transfer tasks and perform competitively on semantic textual similarity tasks, using only unsupervised data.",
}
The repo is adapted from NL-Augmenter and SimCSE