Skip to content


Switch branches/tags


Failed to load latest commit information.

Dataset and code of our EMNLP 2019 Paper (Multilingual and Multi-Aspect Hate Speech Analysis)

For more details about our dataset, please check our paper:

		title = "Multilingual and Multi-Aspect Hate Speech Analysis",
		author = "Ousidhoum, Nedjma
         		and Lin, Zizheng
         		and Zhang, Hongming
        		and Song, Yangqiu
        		and Yeung, Dit-Yan",
			booktitle = "Proceedings of EMNLP",
		year = "2019",
		publisher =	"Association for Computational Linguistics",

(You can preview our paper on


The multi-labeled tasks are the hostility type of the tweet and the annotator's sentiment. (We kept labels on which at least two annotators agreed on.)


In further experiments that involved binary classification tasks of the hostility/hate/abuse type, we considered single-labeled normal instances to be non-hate/non-toxic and all the other instances to be toxic.


Our dataset is composed of three csv files sorted by language. They contain the tweets and the annotations described in our paper:

the hostility type (column: tweet sentiment)

hostility directness (column: directness)

target attribute (column: target)

target group (column: group)

annotator's sentiment (column: annotator sentiment).


To replicate our experiments, please follow the guidelines below.


Python 3.6 onwards,

dyNET 0.0.0 and its dependencies (follow the instructions on

[On a side note, when you install DyNet make sure that you are using CUDA 9 and CUDNN for CUDA 9. I used the following command:

	CUDNN_ROOT=/path/to/conda/pkgs/cudnn-7.3.1-cuda10.0_0 \
	BACKEND=/path/to/conda/pkgs/cudatoolkit-10.0.130-0 \
	pip install git+ 

Using CUDA 10 will generate an error when calling DyNet for GPUs.]

Cross-lingual word embeddings (Babylon or MUSE. The reported results have been run using Babylon.)

Python files

  • contains a normalization function that cleans the content of the tweets.

  • defines constants used across all the files.

  • utility functions for data processing.

  • allows you to run majority voting and logistic regression by calling:

      run_majority_voting(train_filename, dev_filename, test_filename, attribute) 


	run_logistic_regression(train_filename, dev_filename, test_filename, attribute)

on csv files of the same form of the dataset. contains classes for sequence predictors and layers. Script to train, load, and evaluate SluiceNetwork. The main logic for the SluiceNetwork (Ruder et al. 2017). More details on the implementation of Sluice networks can be found here.

How to run the program

To save and load the trained model, you need to create a directory (e.g., model/), and specify the name of the created directory when using --model-dir argument in the command line.

To save the log files of the training and evaluation, you need to create a directory (e.g., log/), and specify the name of the created directory when using --log-dir argumnet in the command line.


	python --dynet-autobatch 1  --dynet-gpus 3 --dynet-seed 123 \
                      --h-layers 1 \
                     --num-subspaces 2 --constraint-weight 0.1 \
                     --constrain-matrices 1 2 --patience 3 \
                     --languages ar en fr \
		 --test-languages ar en fr \
                     --model-dir model/ --log-dir log/\
                     --task-names annotator_sentiment sentiment directness group target \
		 --train-dir '/path/to/train' \
                     --dev-dir '/path/to/dev' \
                     --test-dir 'path/to/test' \
                     --embeds babylon --h-dim 200 \
		 --cross-stitch-init-scheme imbalanced \
		 --threshold 0.1


  • The meaning of each argument can be found in

  • '--task_names' refers to a list of task names (e.g: annotator_sentiment).

  • '--languages' refers to the language dataset which will be used in the training.

  • 'test-languages' can only be the subset of '--languages'.


Dataset and code of our EMNLP 2019 paper "Multilingual and Multi-Aspect Hate Speech Analysis"







No releases published


No packages published