Skip to content

Worth-reading papers and related resources on attention mechanism, Transformer and pretrained language model (PLM) such as BERT. 值得一读的注意力机制、Transformer和预训练语言模型论文与相关资源集合

Doragd/ATPapers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 

Repository files navigation

ATPapers

Worth-reading papers and related resources on Attention Mechanism, Transformer and Pretrained Language Model (PLM) such as BERT.

Suggestions about fixing errors or adding papers, repositories and other resources are welcomed!

Since I am Chinese, I mainly focus on Chinese resources. Welcome to recommend excellent resources in English or other languages!

值得一读的注意力机制、Transformer和预训练语言模型论文与相关资源集合。

欢迎修正错误以及新增论文、代码仓库与其他资源等建议!

Attention

Papers

  • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. (ICML 2015) [paper] - Hard & Soft Attention
  • Effective Approaches to Attention-based Neural Machine Translation. Minh-Thang Luong, Hieu Pham, Christopher D. Manning. (EMNLP 2015) [paper] - Global & Local Attention
  • Neural Machine Translation by Jointly Learning to Align and Translate. Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. (ICLR 2015) [paper]
  • Non-local Neural Networks. Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He. (CVPR 2018) [paper][code]
  • Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. Gongbo Tang, Mathias Müller, Annette Rios, Rico Sennrich. (EMNLP 2018) [paper]
  • Phrase-level Self-Attention Networks for Universal Sentence Encoding. Wei Wu, Houfeng Wang, Tianyu Liu, Shuming Ma. (EMNLP 2018) [paper]
  • Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling. Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang. (ICLR 2018) [paper][code] - Bi-BloSAN
  • Efficient Attention: Attention with Linear Complexities. Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, Hongsheng Li. (CoRR 2018) [paper][code]
  • Leveraging Local and Global Patterns for Self-Attention Networks. Mingzhou Xu, Derek F. Wong, Baosong Yang, Yue Zhang, Lidia S. Chao. (ACL 2019) [paper] [tf code][pt code]
  • Attention over Heads: A Multi-Hop Attention for Neural Machine Translation. Shohei Iida, Ryuichiro Kimura, Hongyi Cui, Po-Hsuan Hung, Takehito Utsuro, Masaaki Nagata. (ACL 2019) [paper]
  • Are Sixteen Heads Really Better than One?. Paul Michel, Omer Levy, Graham Neubig. (NeurIPS 2019) [paper]
  • Attention is not Explanation. Sarthak Jain, Byron C. Wallace. (NAACL 2019) [paper]
  • Attention is not not Explanation. Sarah Wiegreffe, Yuval Pinter. (EMNLP 2019) [paper]
  • Is Attention Interpretable?. Sofia Serrano, Noah A. Smith. (ACL 2019) [paper]
  • Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words?. Cansu Sen, Thomas Hartvigsen, Biao Yin, Xiangnan Kong, Elke Rundensteiner. (ACL 2020) [paper] - YELP-HAT
  • The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?. Jasmijn Bastings, Katja Filippova. (BlackboxNLP 2020) [paper]
  • Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui. (EMNLP 2020) [paper][code]
  • Approximating How Single Head Attention Learns. Charlie Snell, Ruiqi Zhong, Dan Klein, Jacob Steinhardt. (CoRR 2021) [paper]

Survey & Review

  • An Attentive Survey of Attention Models. Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, Varun Mithal. (IJCAI 2019) [paper]

English Blog

Chinese Blog

Repositories

Transformer

Papers

  • Attention is All you Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. (NIPS 2017) [paper][code] - Transformer
  • Weighted Transformer Network for Machine Translation. Karim Ahmed, Nitish Shirish Keskar, Richard Socher. (CoRR 2017) [paper][code]
  • Accelerating Neural Transformer via an Average Attention Network. Biao Zhang, Deyi Xiong, Jinsong Su. (ACL 2018) [paper][code] - AAN
  • Self-Attention with Relative Position Representations. Peter Shaw, Jakob Uszkoreit, Ashish Vaswani. (NAACL 2018) [paper] [unoffical code]
  • Universal Transformers. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Lukasz Kaiser. (ICLR 2019) [paper][code]
  • Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, Ruslan Salakhutdinov. (ACL 2019) [paper]
  • Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov. (ACL 2019) [paper]
  • Star-Transformer. Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, Zheng Zhang. (NAACL 2019) [paper]
  • Generating Long Sequences with Sparse Transformers. Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever. (CoRR 2019) [paper][code]
  • Memory Transformer Networks. Jonas Metzger. (CS224n Winter2019 Reports) [paper]
  • Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel. Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, Ruslan Salakhutdinov. (EMNLP 2019) [paper][code]
  • Transformers without Tears: Improving the Normalization of Self-Attention. Toan Q. Nguyen, Julian Salazar. (IWSLT 2019) [paper][code]
  • TENER: Adapting Transformer Encoder for Named Entity Recognition. Hang Yan, Bocao Deng, Xiaonan Li, Xipeng Qiu. (CoRR 2019) [paper]
  • Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection. Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun. (CoRR 2019) [paper][code]
  • Compressive Transformers for Long-Range Sequence Modelling. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap. (ICLR 2020) [paper][code]
  • Reformer: The Efficient Transformer. Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. (ICLR 2020) [paper] [code 1][code 2][code 3]
  • On Layer Normalization in the Transformer Architecture. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu. (ICML 2020) [paper]
  • Lite Transformer with Long-Short Range Attention. Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han. (ICLR 2020) [paper][code]
  • ReZero is All You Need: Fast Convergence at Large Depth. Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, Julian McAuley. (CoRR 2020) [paper] [code] [related Chinese post]
  • Improving Transformer Models by Reordering their Sublayers. Ofir Press, Noah A. Smith, Omer Levy. (ACL 2020) [paper]
  • Highway Transformer: Self-Gating Enhanced Self-Attentive Networks. Yekun Chai, Jin Shuo, Xinwen Hou. (ACL 2020) [paper][code]
  • Efficient Content-Based Sparse Attention with Routing Transformers. Aurko Roy, Mohammad Saffar, Ashish Vaswani, David Grangier. (TACL 2020) [paper][code]
  • HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, Song Han. (ACL 2020) [paper][code]
  • Longformer: The Long-Document Transformer. Iz Beltagy, Matthew E. Peters, Arman Cohan. (CoRR 2020) [paper][code]
  • Talking-Heads Attention. Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou. (CoRR 2020) [paper]
  • Synthesizer: Rethinking Self-Attention in Transformer Models. Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng. (CoRR 2020) [paper]
  • Linformer: Self-Attention with Linear Complexity. Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma. (CoRR 2020) [paper]
  • Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret. (ICML 2020) [paper][code][project]
  • Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. (CoRR 2020) [paper][code]
  • Fast Transformers with Clustered Attention. Apoorv Vyas, Angelos Katharopoulos, François Fleuret. (CoRR 2020) [paper][code]
  • Memory Transformer. Mikhail S. Burtsev, Grigory V. Sapunov. (CoRR 2020) [paper]
  • Multi-Head Attention: Collaborate Instead of Concatenate. Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi. (CoRR 2020) [paper][code]
  • Big Bird: Transformers for Longer Sequences. Big Bird: Transformers for Longer Sequences. (CoRR 2020) [paper]
  • Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler. (CoRR 2020) [paper]
  • Rethinking Attention with Performers. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller. (CoRR 2020) [paper][code][pytorch version][blog]
  • Long Range Arena: A Benchmark for Efficient Transformers. Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler. (CoRR 2020) [paper][code]
  • Very Deep Transformers for Neural Machine Translation. Xiaodong Liu, Kevin Duh, Liyuan Liu, Jianfeng Gao. (CoRR 2020) [paper][code]
  • DeLighT: Very Deep and Light-weight Transformer. Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi. (CoRR 2020) [paper][code]
  • FastFormers: Highly Efficient Transformer Models for Natural Language Understanding. Young Jin Kim, Hany Hassan Awadalla. (SustaiNLP 2020 at EMNLP 2020) [paper][code]
  • RealFormer: Transformer Likes Residual Attention. Ruining He, Anirudh Ravula, Bhargav Kanagal, Joshua Ainslie. (CoRR 2020) [paper]
  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. William Fedus, Barret Zoph, Noam Shazeer. (CoRR 2021) [paper]
  • Mask Attention Networks: Rethinking and Strengthen Transformer. Zhihao Fan, Yeyun Gong, Dayiheng Liu, Zhongyu Wei, Siyuan Wang, Jian Jiao, Nan Duan, Ruofei Zhang, Xuanjing Huang. (NAACL 2021) [paper] - MAN

Chinese Blog

English Blog

Repositories

Pretrained Language Model

Models

  • Deep Contextualized Word Representations (NAACL 2018) [paper] - ELMo
  • Universal Language Model Fine-tuning for Text Classification (ACL 2018) [paper] - ULMFit
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (NAACL 2019) [paper][code][official PyTorch code] - BERT
  • Improving Language Understanding by Generative Pre-Training (CoRR 2018) [paper] - GPT
  • Language Models are Unsupervised Multitask Learners (CoRR 2019) [paper][code] - GPT-2
  • MASS: Masked Sequence to Sequence Pre-training for Language Generation (ICML 2019) [paper][code] - MASS
  • Unified Language Model Pre-training for Natural Language Understanding and Generation (CoRR 2019) [paper][code] - UNILM
  • Multi-Task Deep Neural Networks for Natural Language Understanding (ACL 2019) [paper][code] - MT-DNN
  • 75 Languages, 1 Model: Parsing Universal Dependencies Universally[paper][code] - UDify
  • Defending Against Neural Fake News (CoRR 2019) [paper][code] - Grover
  • ERNIE 2.0: A Continual Pre-training Framework for Language Understanding (CoRR 2019) [paper] - ERNIE 2.0 (Baidu)
  • Pre-Training with Whole Word Masking for Chinese BERT (CoRR 2019) [paper] - Chinese-BERT-wwm
  • SpanBERT: Improving Pre-training by Representing and Predicting Spans (CoRR 2019) [paper] - SpanBERT
  • XLNet: Generalized Autoregressive Pretraining for Language Understanding (CoRR 2019) [paper][code] - XLNet
  • RoBERTa: A Robustly Optimized BERT Pretraining Approach (CoRR 2019) [paper] - RoBERTa
  • NEZHA: Neural Contextualized Representation for Chinese Language Understanding (CoRR 2019) [paper][code] - NEZHA
  • Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (CoRR 2019) [paper][code] - Megatron-LM
  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transforme (CoRR 2019) [paper][code] - T5
  • BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (CoRR 2019) [paper] - BART
  • ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations (CoRR 2019) [paper][code] - ZEN
  • The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service (CoRR 2019) [paper][code] - BAAI-JDAI-BERT
  • UER: An Open-Source Toolkit for Pre-training Models (EMNLP 2019) [paper][code] - UER
  • ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (ICLR 2020) [paper] - ELECTRA
  • StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding (ICLR 2020) [paper] - StructBERT
  • FreeLB: Enhanced Adversarial Training for Language Understanding (ICLR 2020) [paper][code] - FreeLB
  • HUBERT Untangles BERT to Improve Transfer across NLP Tasks (CoRR 2019) [paper] - HUBERT
  • CodeBERT: A Pre-Trained Model for Programming and Natural Languages (CoRR 2020) [paper] - CodeBERT
  • ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training (CoRR 2020) [paper] - ProphetNet
  • ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation (CoRR 2020) [paper][code] - ERNIE-GEN
  • Efficient Training of BERT by Progressively Stacking (ICML 2019) [paper][code] - StackingBERT
  • PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination (CoRR 2020) [paper][code]
  • Towards a Human-like Open-Domain Chatbot (CoRR 2020) [paper] - Meena
  • UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training (CoRR 2020) [paper][code] - UNILMv2
  • Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space (CoRR 2020) [paper][code] - Optimus
  • SegaBERT: Pre-training of Segment-aware BERT for Language Understanding. He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Ming Li. (CoRR 2020) [paper]
  • Adversarial Training for Large Neural Language Models. Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, Jianfeng Gao. (CoRR 2020) [paper][code] - ALUM
  • MPNet: Masked and Permuted Pre-training for Language Understanding (CoRR 2020) [paper][code] - MPNet
  • Language Models are Few-Shot Learners (CoRR 2020) [paper][code] - GPT-3
  • SPECTER: Document-level Representation Learning using Citation-informed Transformers (ACL 2020) [paper] - SPECTER
  • Recipes for building an open-domain chatbot (CoRR 2020) [paper][post][code] - Blender
  • Revisiting Pre-Trained Models for Chinese Natural Language Processing. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Guoping Hu. (Findings of EMNLP 2020) [paper][code][blog] - MacBERT
  • PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning (CoRR 2020) [paper][code] - PLATO-2
  • DeBERTa: Decoding-enhanced BERT with Disentangled Attention (CoRR 2020) [paper][code] - DeBERTa
  • ConvBERT: Improving BERT with Span-based Dynamic Convolution. Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. (CoRR 2020) [paper][code]
  • AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization. Xinsong Zhang, Hang Li. (CoRR 2020) [paper]
  • CharBERT: Character-aware Pre-trained Language Model. Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, Shijin Wang, Guoping Hu. (COLING 2020) [paper][code][blog]
  • MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining. Wei Zhu. (CoRR 2020) [paper]
  • Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees. Jiangang Bai, Yujing Wang, Yiren Chen, Yaming Yang, Jing Bai, Jing Yu, Yunhai Tong. (EACL 2021) [paper]
  • All NLP Tasks Are Generation Tasks: A General Pretraining Framework. Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, Jie Tang. (CoRR 2021) [paper][code] - GLM

Multi-Modal

  • VideoBERT: A Joint Model for Video and Language Representation Learning. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid. (ICCV 2019) [paper]
  • Learning Video Representations using Contrastive Bidirectional Transformer. Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid. (CoRR 2019) [paper] - CBT
  • ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee. (NeurIPS 2019) [paper][code]
  • VisualBERT: A Simple and Performant Baseline for Vision and Language. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. (CoRR 2019) [paper][code]
  • Fusion of Detected Objects in Text for Visual Question Answering. Chris Alberti, Jeffrey Ling, Michael Collins, David Reitter. (EMNLP 2019) [paper][code] - B2T2
  • LXMERT: Learning Cross-Modality Encoder Representations from Transformers. Hao Tan, Mohit Bansal. (EMNLP 2019) [paper][code]
  • Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou. (AAAI 2020) [paper]
  • FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval. Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, Hao Wang. (SIGIR 2020) [paper]
  • UNITER: Learning UNiversal Image-TExt Representations. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu. (ECCV 2020) [paper][code]
  • Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao. (ECCV 2020) [paper][code]
  • VD-BERT: A Unified Vision and Dialog Transformer with BERT. VD-BERT: A Unified Vision and Dialog Transformer with BERT. (EMNLP 2020) [paper][code]
  • CodeBERT:A Pre-Trained Model for Programming and Natural Languages. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou. (EMNLP 2020) [paper][code]
  • VL-BERT: Pre-training of Generic Visual-Linguistic Representatio. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai. (ICLR 2020) [paper][code]
  • ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. (CoRR 2020) [paper]
  • Large-Scale Adversarial Training for Vision-and-Language Representation Learning. Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu. (NeurIPS 2020) [paper] - VILLA

Multilingual

  • Cross-lingual Language Model Pretraining. Guillaume Lample, Alexis Conneau. (NeuIPS 2019) [paper][code] - XLM
  • Unsupervised Cross-lingual Representation Learning at Scale. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov. (ACL 2020) [paper] - XLM-R
  • Multilingual Denoising Pre-training for Neural Machine Translation. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. (CoRR 2020) [paper] - mBART
  • MultiFiT: Efficient Multi-lingual Language Model Fine-tuning. Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, Marcin Kadras, Sylvain Gugger, Jeremy Howard. (EMNLP 2019) [paper][code]
  • XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson. (CoRR 2020) [paper][code]
  • Pre-training via Paraphrasing. Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, Luke Zettlemoyer. (CoRR 2020) [paper] - MARGE
  • WikiBERT Models: Deep Transfer Learning for Many Languages. Sampo Pyysalo, Jenna Kanerva, Antti Virtanen, Filip Ginter. (CoRR 2020) [paper][code]
  • Language-agnostic BERT Sentence Embedding. Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang. (CoRR 2020) [paper] - LaBSE
  • Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information. Zehui Lin, Xiao Pan, Mingxuan Wang, Xipeng Qiu, Jiangtao Feng, Hao Zhou, Lei Li. (EMNLP 2020) [paper][code] - mRASP
  • mT5: A massively multilingual pre-trained text-to-text transformer. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. (CoRR 2020) [paper][code]
  • InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, Ming Zhou (CoRR 2020) [paper][code]

Knowledge

  • ERNIE: Enhanced Language Representation with Informative Entities. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, Qun Liu. (ACL 2019) [paper][code]
  • ERNIE: Enhanced Representation through Knowledge Integration. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. (CoRR 2019) [paper]
  • Knowledge Enhanced Contextual Word Representations. Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, Noah A. Smith. (EMNLP 2019) [paper] - KnowBert
  • K-BERT: Enabling Language Representation with Knowledge Graph. Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, Ping Wang. (AAAI 2020) [paper][code]
  • KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation. Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, Jian Tang. (TACL 2020) [paper][code]
  • CoLAKE: Contextualized Language and Knowledge Embedding. Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang. (COLING 2020) [paper][code]
  • Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning. Tao Shen, Yi Mao, Pengcheng He, Guodong Long, Adam Trischler, Weizhu Chen. (EMNLP 2020) [paper]
  • K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu ji, Guihong Cao, Daxin Jiang, Ming Zhou. (CoRR 2020) [paper]
  • BERT-MK: Integrating Graph Contextualized Knowledge into Pre-trained Language Models. Bin He, Di Zhou, Jinghui Xiao, Xin Jiang, Qun Liu, Nicholas Jing Yuan, Tong Xu. (EMNLP 2020) [paper]
  • JAKET: Joint Pre-training of Knowledge Graph and Language Understanding. Donghan Yu, Chenguang Zhu, Yiming Yang, Michael Zeng. (CoRR 2020) [paper]

Compression & Accelerating

  • Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin. (CoRR 2019) [paper]
  • Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System. Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, Daxin Jiang. (CoRR 2019) [paper] - MKDM
  • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao. (CoRR 2019) [paper]
  • Well-Read Students Learn Better: On the Importance of Pre-training Compact Models. Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. (CoRR 2019) [paper]
  • Small and Practical BERT Models for Sequence Labeling. Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, Amelia Archer. (EMNLP 2019) [paper]
  • Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer. (AAAI 2020) [paper]
  • Patient Knowledge Distillation for BERT Model Compression. Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu. (EMNLP 2019) [paper] - BERT-PKD
  • Extreme Language Model Compression with Optimal Subwords and Shared Projections. Sanqiang Zhao, Raghav Gupta, Yang Song, Denny Zhou. (ICLR 2019) [paper]
  • DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. [paper][code]
  • TinyBERT: Distilling BERT for Natural Language Understanding. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu. (ICLR 2019) [paper][code]
  • Q8BERT: Quantized 8Bit BERT. Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat. (NeurIPS 2019 Workshop) [paper]
  • ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. (ICLR 2020) [paper][code]
  • Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. Mitchell A. Gordon, Kevin Duh, Nicholas Andrews. (ICLR 2020) [paper][PyTorch code]
  • Reducing Transformer Depth on Demand with Structured Dropout. Angela Fan, Edouard Grave, Armand Joulin. (ICLR 2020) [paper] - LayerDrop
  • Multilingual Alignment of Contextual Word Representations (ICLR 2020) [paper]
  • AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search. Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, Jingren Zhou. (IJCAI 2020) [paper] - AdaBERT
  • BERT-of-Theseus: Compressing BERT by Progressive Module Replacing. Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou. (CoRR 2020) [paper][pt code][tf code][keras code]
  • MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. (CoRR 2020) [paper][code]
  • FastBERT: a Self-distilling BERT with Adaptive Inference Time. Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, Haotang Deng, Qi Ju. (ACL 2020) [paper][code]
  • MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou. (ACL 2020) [paper][code]
  • Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation. Bowen Wu, Huan Zhang, Mengyuan Li, Zongsheng Wang, Qihang Feng, Junhong Huang, Baoxun Wang. (CoRR 2020) [paper] - BiLSTM-SRA & LTD-BERT
  • Poor Man's BERT: Smaller and Faster Transformer Models. Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov. (CoRR 2020) [paper]
  • DynaBERT: Dynamic BERT with Adaptive Width and Depth. Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu. (CoRR 2020) [paper]
  • SqueezeBERT: What can computer vision teach NLP about efficient neural networks?. Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, Kurt W. Keutzer. (CoRR 2020) [paper]
  • Optimal Subarchitecture Extraction For BERT. Adrian de Wynter, Daniel J. Perry. (CoRR 2020) [paper][code] - Bort
  • TernaryBERT: Distillation-aware Ultra-low Bit BERT. Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, Qun Liu. (EMNLP 2020) [paper][code][blog]
  • BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance. Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang, Yaohong Jin. (EMNLP 2020) [paper][code][blog]
  • EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP. Thierry Tambe, Coleman Hooper, Lillian Pentecost, En-Yu Yang, Marco Donato, Victor Sanh, Alexander M. Rush, David Brooks, Gu-Yeon Wei. (CoRR 2011) [paper]
  • LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding. Hao Fu, Shaojun Zhou, Qihong Yang, Junjie Tang, Guiquan Liu, Kaikui Liu, Xiaolong Li. (AAAI 2020) [paper][Chinese blog]

Application

  • BERT for Joint Intent Classification and Slot Filling (CoRR 2019) [paper]
  • GPT-based Generation for Classical Chinese Poetry (CoRR 2019) [paper]
  • BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, Peng Jiang. (CIKM 2019) [paper][code]
  • Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) [paper][code]
  • Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring (ICLR 2020) [paper]
  • Pre-training Tasks for Embedding-based Large-scale Retrieval (ICLR 2020) [paper]
  • Keyword-Attentive Deep Semantic Matching (CoRR 2020) [paper & code] [post] - Keyword BERT
  • Unified Multi-Criteria Chinese Word Segmentation with BERT (CoRR 2020) [paper]
  • ToD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogues (CoRR 2020) [paper][code]
  • Spelling Error Correction with Soft-Masked BERT (ACL 2020) [paper] - Soft-Masked BERT
  • DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering (ACL 2020) [paper][code] - DeFormer
  • BLEURT: Learning Robust Metrics for Text Generation (ACL 2020) [paper][code] - BLEURT
  • Context-Aware Document Term Weighting for Ad-Hoc Search (WWW 2020) [paper][code] - HDCT
  • E-BERT: A Phrase and Product Knowledge Enhanced Language Model for E-commerce. Denghui Zhang, Zixuan Yuan, Yanchi Liu, Zuohui Fu, Fuzhen Zhuang, Pengyang Wang, Haifeng Chen, Hui Xiong. (CoRR 2020) [paper]
  • Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks. Nandan Thakur, Nils Reimers, Johannes Daxenberger, Iryna Gurevych. (CoRR 2020) [paper]
  • CogLTX: Applying BERT to Long Texts. Ming Ding, Chang Zhou, Hongxia Yang, Jie Tang. (NeurIPS 2020) [paper][code]

Analysis & Tools

  • Probing Neural Network Comprehension of Natural Language Arguments (ACL 2019) [paper][code]
  • Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference (ACL 2019) [paper] [code]
  • To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks (RepL4NLP@ACL 2019) [paper]
  • BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model. Alex Wang, Kyunghyun Cho. (NeuralGen 2019) [paper]
  • Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection (CICLing 2019) [paper]
  • Understanding the Behaviors of BERT in Ranking (CoRR 2019) [paper]
  • How to Fine-Tune BERT for Text Classification? (CoRR 2019) [paper]
  • What Does BERT Look At? An Analysis of BERT's Attention (BlackBoxNLP 2019) [paper][code]
  • Visualizing and Understanding the Effectiveness of BERT (EMNLP 2019) [paper]
  • exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models (CoRR 2019) [paper] [code]
  • Transformers: State-of-the-art Natural Language Processing [paper][code][code]
  • Do Attention Heads in BERT Track Syntactic Dependencies? [paper]
  • Fine-tune BERT with Sparse Self-Attention Mechanism (EMNLP 2019) [paper]
  • How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings (EMNLP 2019) [paper]
  • oLMpics -- On what Language Model Pre-training Captures (CoRR 2019) [paper]
  • Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment (AAAI 2020) [paper][code] - TextFooler
  • A Mutual Information Maximization Perspective of Language Representation Learning (ICLR 2020) [paper]
  • Cross-Lingual Ability of Multilingual BERT: An Empirical Study (ICLR2020) [paper]
  • Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping (CoRR 2020) [paper]
  • How Much Knowledge Can You Pack Into the Parameters of a Language Model? (CoRR 2020) [paper]
  • A Primer in BERTology: What we know about how BERT works. Anna Rogers, Olga Kovaleva, Anna Rumshisky. (CoRR 2020) [paper]
  • BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations (CoRR 2020) [paper]
  • Contextual Embeddings: When Are They Worth It? (ACL 2020) [paper]
  • Weight Poisoning Attacks on Pre-trained Models (ACL 2020) [paper][code] - RIPPLe
  • Roles and Utilization of Attention Heads in Transformer-based Neural Language Models (ACL 2020) [paper][code] - Transformer Anatomy
  • Adversarial Training for Large Neural Language Models (CoRR 2020) [paper][code]
  • Cross-Lingual Ability of Multilingual BERT: An Empirical Study (ICLR 2020) [paper][code]
  • DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference (ACL 2020) [paper][code][huggingface implementation]
  • Beyond Accuracy: Behavioral Testing of NLP models with CheckList. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh. (ACL 2020 Best Paper) [paper][code]
  • Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith. (ACL 2020) [paper][code]
  • TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing. Ziqing Yang, Yiming Cui, Zhipeng Chen, Wanxiang Che, Ting Liu, Shijin Wang, Guoping Hu. (ACL 2020) [paper][code]
  • Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT. Zhiyong Wu, Yun Chen, Ben Kao, Qun Liu. (ACL 2020) [paper][pt code][keras code]
  • Rethinking Positional Encoding in Language Pre-training. Guolin Ke, Di He, Tie-Yan Liu. (CoRR 2020) [paper][code] - TUPE
  • Variance-reduced Language Pretraining via a Mask Proposal Network. Liang Chen. (CoRR 2020) [paper]
  • Does BERT Solve Commonsense Task via Commonsense Knowledge?. Leyang Cui, Sijie Cheng, Yu Wu, Yue Zhang. (CoRR 2020) [paper]
  • Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference. Timo Schick, Hinrich Schütze. (CoRR 2020) [paper][code] - PET
  • It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. Timo Schick, Hinrich Schütze. (CoRR 2020) [paper][code]
  • Automatically Identifying Words That Can Serve as Labels for Few-Shot Text Classification. Timo Schick, Helmut Schmid, Hinrich Schütze. (COLING 2020) [paper][code]
  • InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective. Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, Ruoxi Jia, Bo Li, Jingjing Liu. (CoRR 2020) [paper]
  • Self-training Improves Pre-training for Natural Language Understanding. Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav Chaudhary, Onur Celebi, Michael Auli, Ves Stoyanov, Alexis Conneau. (CoRR 2020) [paper][related blog]
  • Commonsense knowledge adversarial dataset that challenges ELECTRA. Gongqi Lin, Yuan Miao, Xiaoyong Yang, Wenwu Ou, Lizhen Cui, Wei Guo, Chunyan Miao. (ICARCV 2020) [paper]
  • Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining. Zijun Sun, Chun Fan, Xiaofei Sun, Yuxian Meng, Fei Wu, Jiwei Li. (CoRR 2020) [paper][code][blog]
  • Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, Xiangzhan Yu. (EMNLP 2020) [paper][code][blog]
  • To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging. Kasturi Bhattacharjee, Miguel Ballesteros, Rishita Anubhai, Smaranda Muresan, Jie Ma, Faisal Ladhak, Yaser Al-Onaizan. (EMNLP 2020) [paper]
  • Investigating Novel Verb Learning in BERT: Selectional Preference Classes and Alternation-Based Syntactic Generalization. Tristan Thrush, Ethan Wilcox, Roger Levy. (BlackboxNLP 2020) [paper][code]
  • On the Sentence Embeddings from Pre-trained Language Models. Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, Lei Li. (EMNLP 2020) [paper][code]
  • Using Prior Knowledge to Guide BERT's Attention in Semantic Textual Matching Tasks. Tingyu Xia, Yue Wang, Yuan Tian, Yi Chang. (WWW 2021) [paper][code]
  • Muppet: Massive Multi-task Representations with Pre-Finetuning. Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, Sonal Gupta. (CoRR 2021) [paper]

Tutorial & Survey

  • Transfer Learning in Natural Language Processing. Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, Thomas Wolf. (NAACL 2019) [paper]
  • Evolution of Transfer Learning in Natural Language Processing. Aditya Malte, Pratik Ratadiya. (CoRR 2019) [paper]
  • Transferring NLP Models Across Languages and Domains. Barbara Plank. (DeepLo 2019) [slides]
  • Recent Breakthroughs in Natural Language Processing. Christopher Manning (BAAI 2019) [slides]
  • Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, Xuanjing Huang. (Invited Review of Science China Technological Sciences 2020) [paper]
  • Embeddings in Natural Language Processing. Mohammad Taher Pilehvar, Jose Camacho-Collados. (2020) [book]

Repository

Chinese Blog

English Blog

About

Worth-reading papers and related resources on attention mechanism, Transformer and pretrained language model (PLM) such as BERT. 值得一读的注意力机制、Transformer和预训练语言模型论文与相关资源集合

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published