This repository provides the data and code related to the following ACL2022 publication:
Nejadgholi, I. Fraser, K. C., Kiritchenko, S. (2022). Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. https://arxiv.org/pdf/2204.02261.pdf
As described in the paper, we annotated the Hostile class of the dev set of the East-Asian Prejudice (EA) dataset and the Anti-Asian Hate class of the COVID-HATE (CH) dataset for implicit/explicit abuse. Our annotations are available in the Data folder:
CH_Anti_Asian_hate_implicit_indexes.csv
and CH_Anti_Asianhate_explicit_indexes.csv
include indexes of implicitly and explicitly hateful samples in the Anti-Asian Hate class of the CH dataset, respectively. These indexes correspond to indexes of the annotations.csv
file from the original dataset.
EA_dev_hostile_implicit_ids.csv
and EA_dev_hostile_explicit_ids.csv
include tweet ids of implicitly and explicitly hostile samples of the EA-dev set.
Roberta_model_data.py
: Roberta model and functions to compute gradients and logits of a roberta-based classifier
TCAV.py
: fuctions to claculate sensitivities of a trained classifier to a human-defined concept (TCAV scores described in Section 4 of the paper)
DoE.py
: functions to calcualte the Degree of Explicitness (DoE scores described in Sections 5 and 6 of teh paper)
These notebooks illusterate how to use the above functionalities. In all of the notebooks, the Toxicity classifier refers to a roberta-based binary classifier trained with the Wiki dataset.
TCAV_Example.ipynb
: This notebook shows how to calculate the sensitivity of a trained classifier to a human-defined concept (similar to the results in Table 5 of the paper.
DoE_example
.ipynb: This colab notebook calcuates the Degree of Explicitness (DoE scores introduced in section 5 of the paper).