GitHub - gcunhase/EmbraceBERT: Attentively Embracing Noise for Robust Latent Representation in BERT (COLING 2020)

Code for the paper titled "Attentively Embracing Noise for Robust Latent Representation in BERT" (To appear at COLING 2020, Dec 8-13 2020)

About

EmbraceBERT: attentive embracement layer for BERT encoded tokens for improved robustness in noisy text classification tasks.

Evaluated on 3 settings:
1. Trained and tested with complete data
2. Trained with complete data and tested with incomplete data
3. Trained and tested with incomplete data

Requirements

Tested with Python 3.6.8, PyTorch 1.0.1.post2, CUDA 10.1

pip install --upgrade pip
pip install --default-timeout=1000 torch==1.0.1.post2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
python -m spacy download en

How to Use

1. Dataset

Open-source NLU benchmarks (SNIPS, Chatbot)

Ongoing: Ask Ubuntu, Web Applications Corpora

Already available in the data directory [more info]

Data with STT error repository

2. Train and eval model

Setting 1: clean data
- Proposed: ./scripts/settings_1run/run_proposed_setting1_clean.sh
- Baseline: ./scripts/settings_1run/run_baseline_setting1_clean.sh
Setting 3: noisy data
- Proposed: ./scripts/settings_1run/run_proposed_setting3_noisy.sh
- Baseline: ./scripts/settings_1run/run_baseline_setting3_noisy.sh

Results

Ablation study on: Chatbot • Snips

Ongoing: AskUbuntu • WebApplications

Korean: Chatbot

Getting paper's ablation study

The scripts mentioned here will run the model 10 times. If you wish to run it only once, please change the SEED parameter in the script.

A. Train Model (settings 1 and 3)

Proposed: all tokens (BERT, EBERT, EBERTkvq)

# BERT with tokens
./scripts/[DIR_SETTING_1_OR_3]/run_bertWithTokens_classifier_seeds.sh
# EBERT
./scripts/[DIR_SETTING_1_OR_3]/run_embracebert_classifier_seeds.sh
# EBERTkvq
./scripts/[DIR_SETTING_1_OR_3]/run_embracebert_multiheadattention_bertkvq_classifier_seeds.sh

Baseline (BERT)

./scripts/[DIR_SETTING_1_OR_3]/run_bert_classifier_seeds.sh

B. Test model with noisy data (setting 2)

./scripts/[DIR_SETTING_2]/run_eval_with_incomplete_data.sh

Modify script with the path and type of your model

Additional information

Get mean and std from N runs

Run python script in the get_results directory.

Calculate the number of parameters

--do_calculate_num_params

# BERT
python run_classifier.py --seed 1 --task_name chatbot_intent --model_type $MODEL_NAME --model_name_or_path bert-base-uncased --logging_steps 1 --do_calculate_num_params --do_lower_case --data_dir data/intent_processed/nlu_eval/chatbotcorpus/ --max_seq_length 128 --per_gpu_eval_batch_size=1 --per_gpu_train_batch_size=8 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir ./results/debug_num_params/ --overwrite_output_dir --overwrite_cache --save_best --log_dir ./runs/debug_num_params
# EBERT
python run_classifier.py --seed 1 --task_name chatbot_intent --model_type $MODEL_NAME2 --p $P_TYPE --dimension_reduction_method $DIM_REDUCTION_METHOD --model_name_or_path bert-base-uncased --logging_steps 1 --do_calculate_num_params --do_lower_case --data_dir data/intent_processed/nlu_eval/chatbotcorpus/ --max_seq_length 128 --per_gpu_eval_batch_size=1 --per_gpu_train_batch_size=8 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir ./results/debug_num_params/ --overwrite_output_dir --overwrite_cache --save_best --log_dir ./runs/debug_num_params

Parameters	Options
MODEL_NAME	[`bert`, `bertwithatt`, `bertwithattclsprojection`, `bertwithprojection`, `bertwithprojectionatt`]
MODEL_NAME2	[`embracebert`, `embracebertconcatatt`, `embracebertwithkeyvaluequery`, `embracebertwithkeyvaluequeryconcatatt`]
DIM_REDUCTION_METHOD	[`attention`, `projection`]
P_TYPE	[`multinomial`, `attention_clsquery_weights`]

Generated files

File	Description
`checkpoint-best-${EPOCH_NUMBER}`	Directory with saved model
`eval_results.json`	JSONified train/eval information
`eval_results.txt`	Train/eval information: eval accuracy and loss, global_step and train loss

Acknowledgement

In case you wish to use this code, please cite [To be update]:

@inproceedings{sergio2020ebert_coling,
  author    = {Sergio, G. C. and Moirangthem, D. S. and Lee, M.},
  title     = {Attentively Embracing Noise for Robust Latent Representation in BERT},
  year      = {2020},
  booktitle = {The 28th International Conference on Computational Linguistics (COLING 2020)},
  organization={ACL},
  DOI = {},
}

Please email me at gwena.cs@gmail.com with any requests or questions.

Code based on HuggingFace's repository, work based on BERT and EmbraceNet.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
data		data
get_results		get_results
models		models
results_notes		results_notes
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_notes.md		README_notes.md
get_results_mean_std_multiple.py		get_results_mean_std_multiple.py
get_results_mean_std_multiple_complete.py		get_results_mean_std_multiple_complete.py
get_results_mean_std_multiple_noisePerc.py		get_results_mean_std_multiple_noisePerc.py
get_results_mean_std_testWithCompleteData_noisePerc.py		get_results_mean_std_testWithCompleteData_noisePerc.py
get_results_mean_std_testWithIncompleteData.py		get_results_mean_std_testWithIncompleteData.py
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_classifier_bertquery.py		run_classifier_bertquery.py
utils.py		utils.py
utils_classifier.py		utils_classifier.py

License

gcunhase/EmbraceBERT

Folders and files

Latest commit

History

Repository files navigation

About

Contents

Requirements

How to Use

1. Dataset

2. Train and eval model

Results

Getting paper's ablation study

A. Train Model (settings 1 and 3)

B. Test model with noisy data (setting 2)

Additional information

Get mean and std from N runs

Calculate the number of parameters

Generated files

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages