Skip to content
Switch branches/tags


Failed to load latest commit information.
Latest commit message
Commit time
Nov 11, 2016
Oct 23, 2017
Oct 23, 2017
Jun 15, 2017
Oct 23, 2017
Oct 23, 2017


This is the code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

The code is written in python and requires numpy, scipy, pickle, sklearn, theano and the lasagne library. Some functions/classes are based on the code of John Wieting for the paper "Towards Universal Paraphrastic Sentence Embeddings" (Thanks John!). The example data sets are also preprocessed using the code there.


To install all dependencies virtualenv is suggested:

$ virtualenv .env
$ . .env/bin/activate
$ pip install -r requirements.txt 

Get started

To get started, cd into the directory examples/ and run It downloads the pretrained GloVe word embeddings, and then runs the scripts:

  • is an demo on how to generate sentence embedding using the SIF weighting scheme,
  • and are for the textual similarity tasks in the paper,
  • is for the supervised tasks in the paper.

Check these files to see the options.

Source code

The code is separated into the following parts:

  • SIF embedding: involves The SIF weighting scheme is very simple and is implmented in a few lines.
  • textual similarity tasks: involves,, and data_io provides the code for reading the data, eval is for evaluating the performance, and sim_algo provides the code for our sentence embedding algorithm.
  • supervised tasks: involves,,,, and train provides the entry for training the models (proj_model_sim is for the similarity and entailment tasks, and proj_model_sentiment is for the sentiment task). Check to see the options.
  • utilities: includes,, and These provides utility functions/classes for the above two parts.


For technical details and full experimental results, see the paper.

	author = {Sanjeev Arora and Yingyu Liang and Tengyu Ma}, 
	title = {A Simple but Tough-to-Beat Baseline for Sentence Embeddings}, 
	booktitle = {International Conference on Learning Representations},
	year = {2017}


sentence embedding by Smooth Inverse Frequency weighting scheme




No releases published


No packages published