Lotem Peled, Roi Reichart (pdf)
This repository contains the Sarcasm SIGN dataset, a parallel corpus of sarcastic tweets and their non-sarcastic interpretations, as created by human experts. This corpus was created as part of our paper Sarcasm SIGN: Interpreting Sarcasm with Sentiment Based Monolingual Machine Translation which will be presented in ACL 2017. The repository contains two folders: "corpus" which contains the data files as well as the instructions for our human experts; and "preprocess" which contains code for preprocessing the data and preparing it for a MT system (see ReadMe in preprocess folder).
The Sarcasm SIGN dataset is comprised of 3000 sarcastic tweets (tweets marked with #sarcasm), which are written in English, are not retweets, and do not contain URLs or images. Each sarcastic tweet has five different non sarcastic interpretation. The average sarcastic tweet length is 13.87 words, average interpretation length is 12.10 words and the vocabulary size is 8788 unique words. Following are two examples from our dataset:
Further information regarding the dataset and the instructions given to the human experts can be found in the "corpus" folder.
We engourage researchers to send us their algorithms and results, and we will present them here.
If you use the Sarcasm SIGN dataset and/or algorithm, please cite the following:
Peled, Lotem, and Roi Reichart. "Sarcasm SIGN: Interpreting Sarcasm with Sentiment Based Monolingual Machine Translation." (ACL 2017).
For any questions, inquiries or interesting ideas, feel free to contact us.
Lotem: lotemi.peled@gmail.com || https://sites.google.com/view/lotempeled/
Roi: roiri@ie.technion.ac.il || https://ie.technion.ac.il/~roiri/