text_embedding

This repository contains a fast, scalable, highly-parallel Python implementation of the GloVe [1] algorithm for word embeddings (found in solvers.py) as well as code and scripts to recreate downstream-task results for unsupervised DisC embedding paper. An overview of the latter is provided in this blog post at OffConvex.

If you find this code useful please cite the following:

@inproceedings{arora2018sensing,
  title={A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs},
  author={Arora, Sanjeev and Khodak, Mikhail and Saunshi, Nikunj and Vodrahalli, Kiran},
  booktitle={Proceedings of the 6th International Conference on Learning Representations (ICLR)},
  year={2018}
}

GloVe implementation

An implementation of the GloVe optimization algorithm (as well as code to build the vocab and cooccurrence files, optimize the related SN objective [2], and optimize a source-regularized objective for domain adaptation) can be found in solvers.py. The code scales to an arbitrary number of processors with virtually no memory/communication overhead. In terms of problem size the code scales linearly in time and memory complexity with the number of nonzero entries in the (sparse) cooccurrence matrix.

On a 32-core computer, 25 epochs of AdaGrad run in 3.8 hours on Wikipedia cooccurrences with vocab size ~80K. The original C implementation runs in 2.8 hours on 32 cores. We also implement the option to use regular SGD, which requires about twice as many iterations to reach the same loss; however, the per-iteration complexity is much lower, and on the same 32-core computer 50 epochs finish in 2.0 hours.

Note that our code takes as input an upper-triangular, zero-indexed cooccurrence matrix rather than the full, one-indexed cooccurrence matrix used by the original GloVe code. To convert to our (more disk-memory efficient) version you can use the method reformat_coocfile in solvers.py. We also allow direct, parallel computation of the vocab and cooccurrence files.

Dependencies: numpy, numba, SharedArray

Optional: h5py, mpi4py*, scipy, scikit-learn

* required for parallelism; MPI can be easily installed on Linux, Mac, and Windows Subsystem for Linux

DisC embeddings

Scripts to recreate the results in the paper are provided in the directory scripts-AKSV2018. 1600-dimensional GloVe embeddings trained on the Amazon Product Corpus [3] are provided here.

Dependencies: nltk, numpy, scipy, scikit-learn

Optional: tensorflow

References:

[1] Pennington et al., "GloVe: Global Vectors for Word Representation," EMNLP, 2014.

[2] Arora et al., "A Latent Variable Model Approach to PMI-based Word Embeddings," TACL, 2016.

[3] McAuley et al., "Inferring networks of substitutable and complementary products," KDD, 2015.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data-documents		data-documents
model-Sent2Vec		model-Sent2Vec
model-byte_mLSTM		model-byte_mLSTM
model-quickthoughts		model-quickthoughts
model-skipthoughts		model-skipthoughts
scripts-AKSV2018		scripts-AKSV2018
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
baselines.py		baselines.py
cooc.py		cooc.py
documents.py		documents.py
features.py		features.py
neural.py		neural.py
solvers.py		solvers.py
testvecs.py		testvecs.py
vectors.py		vectors.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text_embedding

GloVe implementation

DisC embeddings

References:

About

Releases

Packages

Contributors 2

Languages

License

NLPrinceton/text_embedding

Folders and files

Latest commit

History

Repository files navigation

text_embedding

GloVe implementation

DisC embeddings

References:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages