GitHub - DevSinghSachan/investigating-text-classifiers: Investigating the Working of Text Classifiers

This repository contains the data and code of the paper: Investigating the Working of Text Classifiers, COLING 2018

Datasets

ACL-IMDB: This is a benchmark dataset (Maas et al., 2011) of IMDB movie reviews for coarse-grained sentiment analysis in which the task is to determine if a review belongs to the positive or to the negative class. The original random split version of this dataset contains an equal number of positive and negative reviews. To construct its lexicon-based version, we apply our approach to the combined training and test splits of this dataset. This dataset is included with this repo.
IMDB reviews: This is a much bigger version of the IMDB movie reviews dataset in which the task is to do fine-grained sentiment analysis. We collect more than 2.5 million reviews from IMDB website and partition them into five classes based on their ratings out of 10. These classes are most-negative, negative, neutral, positive, and most-positive. This dataset can be obtained here: URL
Arxiv abstracts: This is a new multiclass topic classification dataset. It was constructed by collecting more than 1 million abstracts of scientific papers from the website “arxiv.org”. Each paper has one primary category such as cs.AI, stat.ML, etc. that we use as its class label. We selected those primary categories that had at least 500 papers. To extract text data, we use the title and abstract of each paper. This dataset can be obtained here: URL

In both the Arxiv and IMDB dataset, the ratio of the test set to that of training set as 0.6

Dataset Statistics

Citation

If you find the data or code useful, please consider citing our paper as:

@InProceedings{sachan2018investigating,
  author = 	"Sachan, Devendra
		and Zaheer, Manzil
		and Salakhutdinov, Ruslan",
  title = 	"Investigating the Working of Text Classifiers",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"2120--2131",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1180"
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dataset/aclImdb		dataset/aclImdb
dataset_construction		dataset_construction
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset/aclImdb

dataset/aclImdb

dataset_construction

dataset_construction

img

img

README.md

README.md

Repository files navigation

Datasets

Dataset Statistics

Citation

About

Releases

Packages

Languages

DevSinghSachan/investigating-text-classifiers

Folders and files

Latest commit

History

Repository files navigation

Datasets

Dataset Statistics

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages