Skip to content

DevSinghSachan/investigating-text-classifiers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the data and code of the paper: Investigating the Working of Text Classifiers, COLING 2018

Datasets

  • ACL-IMDB: This is a benchmark dataset (Maas et al., 2011) of IMDB movie reviews for coarse-grained sentiment analysis in which the task is to determine if a review belongs to the positive or to the negative class. The original random split version of this dataset contains an equal number of positive and negative reviews. To construct its lexicon-based version, we apply our approach to the combined training and test splits of this dataset. This dataset is included with this repo.

  • IMDB reviews: This is a much bigger version of the IMDB movie reviews dataset in which the task is to do fine-grained sentiment analysis. We collect more than 2.5 million reviews from IMDB website and partition them into five classes based on their ratings out of 10. These classes are most-negative, negative, neutral, positive, and most-positive. This dataset can be obtained here: URL

  • Arxiv abstracts: This is a new multiclass topic classification dataset. It was constructed by collecting more than 1 million abstracts of scientific papers from the website “arxiv.org”. Each paper has one primary category such as cs.AI, stat.ML, etc. that we use as its class label. We selected those primary categories that had at least 500 papers. To extract text data, we use the title and abstract of each paper. This dataset can be obtained here: URL

In both the Arxiv and IMDB dataset, the ratio of the test set to that of training set as 0.6

Dataset Statistics

dataset_stat

Citation

If you find the data or code useful, please consider citing our paper as:

@InProceedings{sachan2018investigating,
  author = 	"Sachan, Devendra
		and Zaheer, Manzil
		and Salakhutdinov, Ruslan",
  title = 	"Investigating the Working of Text Classifiers",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"2120--2131",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1180"
}