A curated list of pretrained sentence(and word) embedding models
Clone or download
Latest commit bce56e4 Jan 15, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Create LICENSE Jan 11, 2019
README.md add EQUATE Jan 15, 2019

README.md

awesome-sentence-embedding

A curated list of pretrained sentence(and word) embedding models

Table of Contents

About This Repo

  • well there are some awesome-lists for word embeddings and sentence embeddings, but all of them are outdated and more importantly incomplete
  • this repo will also be incomplete, but I try my best to find and include all the papers with pretrained models
  • this is not a typical awesome list because it has tables but I guess it's ok and much better than just a huge list
  • if you find any mistakes or find another paper or anything please send a pull request and help me to keep this list up to date
  • to be honest I'm not 100% sure how to represent this data and if you think there is a better way (for example by changing the table headers) please send a pull request and let us discuss it
  • enjoy!

General Framework

  • Almost all the sentence embeddings work like this:
  • Given some sort of word embeddings and an optional encoder (for example an LSTM) they obtain the contextualized word embeddings.
  • Then they define some sort of pooling (it can be as simple as last pooling).
  • Based on that they either use it directly for the supervised classification task (like infersent) or generate the target sequence (like skip-thought).
  • So, in general, we have many sentence embeddings that you have never heard of, you can simply do mean-pooling over any word embedding and it's a sentence embedding!

Word Embeddings

  • Note: don't worry about the language of the code, you can almost always (except for the subword models) just use the pretrained embedding table in the framework of your choice and ignore the training code
paper training code pretrained models
GloVe: Global Vectors for Word Representation C(official) GloVe
Efficient Estimation of Word Representations in Vector Space C(official) Word2Vec
Enriching Word Vectors with Subword Information C++(official) fastText
BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages Python(official) bpemb
ConceptNet 5.5: An Open Multilingual Graph of General Knowledge Python(official) Numberbatch
Non-distributional Word Vector Representations Python(official) WordFeat
Sparse Overcomplete Word Vector Representations C++(official) -
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks charNgram2vec
Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations GO(official) lexvec
Hash Embeddings for Efficient Word Representations -
Dependency-Based Word Embeddings word2vecf
Learning Word Meta-Embeddings - Meta-Emb(broken)
Dict2vec : Learning Word Embeddings using Lexical Dictionaries C++(official) Dict2vec
Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints TF(official) Attract-Repel
Siamese CBOW: Optimizing Word Embeddings for Sentence Representations Siamese CBOW
Offline bilingual word vectors, orthogonal transformations and the inverted softmax Python(official) -
From Paraphrase Database to Compositional Paraphrase Model and Back Theano(official) PARAGRAM
Poincaré Embeddings for Learning Hierarchical Representations Pytorch(official) -
Dynamic Meta-Embeddings for Improved Sentence Representations Pytorch(official) DME/CDME
WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models - RusVectōrēs
Swivel: Improving Embeddings by Noticing What's Missing TF(official) -
Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings - ChineseEmbedding

OOV Handling

Contextualized Word Embeddings

  • Note: all the unofficial models can load the official pretrained models
paper code pretrained models
Learned in Translation: Contextualized Word Vectors CoVe
Universal Language Model Fine-tuning for Text Classification Pytorch(official) ULMFit(English, Zoo)
Deep contextualized word representations ELMO(AllenNLP, TF-Hub)
Improving Language Understanding by Generative Pre-Training Transformer
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding BERT
Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation Pytorch(official) ELMo
Contextual String Embeddings for Sequence Labeling Pytorch(official) Flair
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Transformer-XL

Pooling Methods

Encoders

paper code name
An efficient framework for learning sentence representations TF(official, pretrained) Quick-Thought
Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm DeepMoji
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data Pytorch(official, pretrained) InferSent
Learning Joint Multilingual Sentence Representations with Neural Machine Translation Pytorch(official, pretrained) LASER
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond - LASER++
Learning general purpose distributed sentence representations via large scale multi-task learning Pytorch(official, pretrained) GenSen
Distributed Representations of Sentences and Documents Doc2Vec
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features C++(official, pretrained) Sent2Vec
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books SkipThought
Learning to Generate Reviews and Discovering Sentiment SentimentNeuron
From Word Embeddings to Document Distances C,Python(official) Word Mover's Distance
Word Mover's Embedding: From Word2Vec to Document Embedding C,Python(official) WordMoversEmbeddings
Convolutional Neural Network for Universal Sentence Embeddings Theano(official, pretrained) CSE
Towards Universal Paraphrastic Sentence Embeddings Theano(official, pretrained) ParagramPhrase
Charagram: Embedding Words and Sentences via Character n-grams Theano(official, pretrained) Charagram
Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings Theano(official, pretrained) GRAN
Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations Theano(official, pretrained) para-nmt
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models VSE
VSE++: Improving Visual-Semantic Embeddings with Hard Negatives Pytorch(official, pretrained) VSE++
End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions Theano(official, pretrained) DEISTE
Learning Universal Sentence Representations with Mean-Max Attention Autoencoder TF(official, pretrained) Mean-MaxAAE
BioSentVec: creating sentence embeddings for biomedical texts Python(official, pretrained) BioSentVec
DisSent: Learning Sentence Representations from Explicit Discourse Relations Pytorch(official, email_for_pretrained) DisSent
Universal Sentence Encoder TF-Hub(official, pretrained) USE
Learning Distributed Representations of Sentences from Unlabelled Data Python(official) FastSent
Embedding Text in Hyperbolic Spaces TF(official) HyperText
StarSpace: Embed All The Things! C++(official) StarSpace
A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks Pytorch(official, pretrained) HMTL

Evaluation

Misc

Vector Mapping

Articles

Code Less