This repository serves as a comprehensive collection of resources for the automated identification of online Gender-Based Violence (GBV) and related phenomena.
For further details, see our systematic review paper:
Gavin Abercrombie, Aiqi Jiang, Poppy Gerrard-Abbott, Ioannis Konstas, and Verena Rieser. 2023. Resources for Automated Identification of Online Gender-Based Violence: A Systematic Review. Proceedings of the 7th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics.
Bibtex:
@inproceedings{abercrombie-etal-2023-resources,
title = {Resources for Automated Identification of Online {G}ender-{B}ased {V}iolence: A Systematic Review},
author = {Abercrombie, Gavin} and Jaing, Aiqi and Gerrard-Abbot, Poppy and Konstas, Ioannis and Rieser, Verena},
booktitle = {Proceedings of the 7th Workshop on Online Harms and Abuse},
month = {July},
year = {2023},
address = {Toronto},
publisher = {Association for Computational Linguistics},
}
Something missing?
You can contribute to this list by editing this file and making a pull request.
Please follow this template and add details at the bottom of the list.
Template:
| Reference | Title | Dataset URL | GBV characterisation | Platform | Language | Modality | Sampling | Date of data | Annotators | IRB | Non-aggregated labels | Data Statement |
- Reference: Link to publication or description of the resource
- Title: Publication or dataset name
- Dataset URL: Link to the dataset
- GBV characterisation: How is GBV described (e.g. 'misogyny', `gender' as a hate speech target)
- Platform: e.g. Twitter, TikTok etc.
- Language: e.g. Basque, Scottish Gaelic, Mi’kmaq etc.
- Modality: e.g. Text, Video, Meme etc.
- Sampling: How the source data was sampled e.g. keywords, targeted accounts, random etc.
- Date of data: The dates between which the source data was produced/published
- Annotators: The number and type of annotators e.g. 3 students, 10000 crowdworkers, 5 per item etc.
- IRB: Whether the study passed ethical review by an Institutional Review Board or similar
- Non-aggregated labels: Whether the non-aggregated labels have been released
- Data Statement: Whether there is a data statement (Bender & Friedman, TACL 2018) describing the resource
Reference | Title | Dataset URL | GBV characterisation | Platform | Language | Modality | Sampling | Date of data | Annotators | IRB | Non-aggregated labels | Data Statement |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Al-Hassan and Al-Dossari, 2022 | Detection of hate speech in Arabic tweets using deep learning | N/A | Sexism | Arabic | Text | Keywords | Unknown | 2 volunteers | No | No | No | |
Almanea and Poesio, 2022 | ArMIS - The Arabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements | https://codalab.lisn.upsaclay.fr/competitions/6146#learn_the_details-get_starting_kit | Misogyny, Sexism | Arabic | Text | Keywords | October 2020 | 3 main annotators and 32 others. Self-defined beliefs and gender | No | Yes | No | |
Alsafari et al., 2020 | Hate and offensive speech detection on Arabic social media | https://github.com/sbalsefri/ArabicHateSpeechDataset | Gender as category | Arabic (Gulf) | Text | keywords, hashtags, profiles | April - September 2019 | 3: 2 women, 1 man | No | No | No | |
Anzovino et al., 2018 | Automatic Identification and Classification of Misogynistic Language on Twitter | https://amievalita2018.wordpress.com/data/ | Misogyny | English | Text | keywords, hashtags, mentions of potential harassed users, self-declared mysoginist profiles | 2017 | 3 experts + crowdworkers | No | No | Yes | |
Assenmacher et al., 2021 | RP-Mod & RP-Crowd: Moderator- and Crowd-Annotated German News Comment Datasets | https://zenodo.org/record/5291339#.Y6RfyOLP3S6 | Sexism | Rheinische Post | German | Text | Comments blocked by community managers | Nov. 2018 - June 2020 | 5 per item | No | No | No |
Basile et al., 2019 | SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter | https://competitions.codalab.org/competitions/19935 | Women as target | English, Spanish | Text | Victims of hate accounts; identified haters; keywords | July 2018 - Sept. 2018 + from earlier datasets | Crowdworkers | No | No | No | |
Bhattacharya et al., 2020 | Developing a Multilingual Annotated Corpus of Misogyny and Aggression | https://sites.google.com/view/trac2/shared-task?pli=1 | Misogyny | Facebook, Twitter, YouTube | Bangla, English, Hindi, code-mixed | Text | Topics | Unknown | 4 linguists 'expected to have a centrist or left-leaning political orientation' | No | No | No |
Borkan et al., 2019 | Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification | https://www.tensorflow.org/datasets/catalog/civil_comments | Gender as category, subgroups: Male, Female, Transgender, Other gender | Comment forums | English | Text | Unknown | Unknown | Crowdworkers | Unknown | No | No |
Bosco et al. | Overview of the EVALITA 2018 Hate Speech Detection Task | http://www.di.unito.it/~tutreeb/haspeede-evalita18/data.html | 'Gender issues'-based hate | Facebook, Twitter | Italian | Text | Facebook: targeted pages and groups; Twitter: keywords | Facebook: 2016; Twitter 2017-2018 | Facebook: bachelor students; Twitter: experts and crowdworkers | No | No | No |
Cercas Curry et al., 2021 | ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI | https://github.com/amandacurry/convabuse | Sexism, sexual harrassment | Dialogue systems: ELIZA, CarbonBot (Facebook) | English | Text | Stratified keyword | CarbotBot: Oct. 2019 - Dec. 2020; ELIZA: Dec. 2002 - Nov. 2007 | 6 female and 2 non-binary Gender Studies students | Yes | Yes | Yes |
Chiril et al., 2021 | "Be nice to your wife! The restaurants are closed": Can Gender Stereotype Detection Improve Sexism Classification? | https://bit.ly/FrenchGenderStereotypes | Sexism | French | Text | Keywords, personal names, hashtags | Unknown | 1 male, 1 female students in Linguistics and Communication and Gender | No | No | No | |
Chiril et al., 2019 | Multilingual and Multitarget Hate Speech Detection in Tweets | N/A | Sexism | 2 female and 1 male students in Communication and Gender | French | Text | Oct. 2017 - May 2018 | Keywords | No | No | No | |
Chiril et al. 2020 | An Annotated Corpus for Sexism Detection in French Tweets | https://github.com/patriChiril/An-Annotated-Corpus-for-Sexism-Detection-in-French-Tweets | Sexism | French | Text | Keywords, hashtags, personal names | Oct. 2017 - May 2018 | 3 female and 2 male Communication and Gender students | No | No | No | |
Chung and Lin, 2021 | TOCAB: A Dataset for Chinese Abusive Language Processing | http://nlp.cse.ntou.edu.tw/resources/TOCAB/ | Sex (gender, sexual orientation, or gender identity) as abuse category | PTT (Taiwanese bulletin board) | Chinese | Text | Popular posts | Mar. 2019 - June 2019 | 12 students | No | No | No |
Das et al., 2022 | Hate Speech and Offensive Language Detection in Bengali | https://github.com/hate-alert/Bengali_Hate | Gender as target | Bengali | Text | Keywords | Unknown | 4 Computer Science students | No | No | No | |
El Ansari et al., 2020 | A Dataset to Support Sexist Content Detection in Arabic Text | N/A | Sexism, discrimination and Violence Against Women | Arabic | Text | Keywords | 2018 | Volunteers | No | No | No | |
Fanton et al., 2021 | Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech | https://github.com/marcoguerini/CONAN | Women as target | Semi-synthetic text | English | Text | Unknown | Unknown | 3 student interns | No | No | No |
Fersini et al., 2018 | Overview of the Task on Automatic Misogyny Identification at IberEval 2018 | https://amiibereval2018.wordpress.com/important-dates/data/ | Misogyny | English Spanish | Text | Keywords | July 2017 - Nov. 2017 | Unknown + crowdworkers | No | No | No | |
Fersini et al., 2020 | AMI @ EVALITA2020: Automatic Misogyny Identification | https://github.com/dnozza/ami2020 | Misogyny | Italian | Text | Unknown | 2018 + 2020 | Unknown | No | No | No | |
Fersini et al., 2022 | SemEval-2022 Task 5: Multimedia Automatic Misogyny Identification | https://competitions.codalab.org/competitions/34175 | Misogyny | Twitter, Reddit; meme sites e.g., 9GaG, Knowyourmeme, Imgur | English | Memes | Threads with women as the subject; anti-women accounts; (3) target victim accounts; (4) keywords and hashtags | Unknown | Unknown | No | No | No |
García-Díaz et al., 2021 | Detecting misogyny in Spanish tweets. An approach based on linguistics features and word embeddings | https://pln.inf.um.es/corpora/misogyny/misocorpus-spanish-2020.rar | Misogyny | Spanish | Text | Target accounts; geographical locations; keywords | Unknown | 5 women, 2 men: authors, 2 colleagues, 1 student | No | No | No | |
Gomez et al., 2021 | Exploring hate speech detection in multimodal publications | https://drive.google.com/file/d/1S9mMhZFkntNnYdO-1dZXwF_8XIiFcmlF | Sexism | English | Image + text | Keywords | Sept. 2018 - Feb. 2019 | 3 crowdworkers per item | No | No | No | |
Gong et al., 2021 | Abusive Language Detection in Heterogeneous Contexts: Dataset Collection and the Role of Supervised Attention | N/A | Gender and sexuality as target | YouTube | English | Text | Keywords | 2017 | 17 psychology students, incl. 3 graduate students studying bullying and related phenomena | No | No | No |
Grosz and Conde-Cespedes, 2020 | Automatic Detection of Sexist Statements Commonly Used at the Workplace | https://github.com/dylangrosz/Automatic_Detection_of_Sexist_Statements_Commonly_Used_at_the_Workplace | Sexism | Twitter, work-related quotes, press quotes, faculty/student submissions | English | Text | Keywords | Unknown | Authors: 1 male, 1 female | No | No | No |
Guellil et al., 2021 | Sexism detection: The first corpus in Algerian dialect with a code-switching in Arabic/ French and English | N/A | Sexism | YouTube | Arabic (Algerian) | Text | Keywords and manually selected video IDs | Feb. - Mar. 2019 | 3 Algerian Arabic speakers | No | No | No |
Guest et al., 2021 | An Expert Annotated Dataset for the Detection of Online Misogyny | https://github.com/ellamguest/online-misogyny-eacl2021 | Misogyny | English | Text | Targeted subreddits | Feb. - May 2020 | 6 annotators trained (by the authors) to identify misogynistic content | No | No | Yes | |
Hewitt et al., 2016 | The problem of identifying misogynist language on Twitter (and other online social spaces) | N/A | Misogyny | English | Text | Keywords | Unknown | 1 researcher | No | No | No | |
Höfels et al., 2022 | CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets | https://aclanthology.org/2022.lrec-1.243/ | Sexism | Romanian | Text | Keywords | May - Sept. 2021 | 10: 7 female, 3 male students in Languages and Literature and Modern Applied Languages 'with an interest/knowledge in gender studies' | No | Yes | Partial | |
Ibrohim et al., 2019 | Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter | https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection | Hate based on gender as category | Indonesian | Text | Previous datasets + keywords | Mar. - Oct. 2018 + old data | 30 from different backgrounds. 3 per item. | No | No | No | |
Jha and Mamidi, 2017 | When does a compliment become sexist? Analysis and classification of ambivalent sexism using Twitter data | https://github.com/AkshitaJha/NLP_CSS_2017 | Sexism | English | Text | Keywords | Unknown | Authors + 3 23-year old non-activist feminists | No | No | No | |
Jiang et al., 2022 | SWSR: A Chinese dataset and lexicon for online sexism detection | https://zenodo.org/record/4773875#.Y5DTMYLP3ao | Sexism | Sina Weibo | Chinese | Text | Keywords | June 2015 - June 2020 | 3: 2 female and one male PHD students | No | No | No |
Jeong et al., 2022 | KOLD: Korean Offensive Language Dataset | https://github.com/boychaboy/KOLD | Gender and sexual orientation as target | NAVER news, YouTube | Korean | Text | Keywords | Mar. 2020 - Mar. 2022 | 3,124 crowdworkers | Yes | No | No |
Kennedy et al., 2020 | Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application | https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech | Gender identity as target, sexist speech | Twitter, Reddit, YouTube | English | Text | Stratified sampling with identity relevance and hate speech hypothesis scores | Mar. - Aug. 2019 | 7,912 crowdworkers | No | Yes | No |
Kennedy et al., 2022 | Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale | https://osf.io/edua3/ | Gender identity as target | Gab | English | Text | Targeted data source (Gab) | Jan. - Oct. 2018 | Min. 3 per item: undergraduate research assistants | No | No | No |
Kirk et al., 2023 | SemEval-2023 Task 10: Explainable Detection of Online Sexism | https://github.com/rewire-online/edos | Sexism | Gab, Reddit | English | Text | Targeted data sources | Aug. 2016 Oct. 2018 | 19 women | No | Yes | Yes |
Kumar et al. 2018 | Aggression-annotated Corpus of Hindi-English Code-mixed Data | https://github.com/kraiyani/Facebook-Post-Aggression-Identification | Gendered aggression | Facebook, Twitter | Hindi-English | Text | Targeted pages, hashtags | Unknown | 4 PhD Linguistics students | No | No | No |
Kwarteng et al., 2022 | Misogynoir: challenges in detecting intersectional hate | https://github.com/kwartengj/Snam2022 | Misogynoir (misogyny aimed at Black women) | English | Text | Unknown | Unknown | 3 | No | No | No | |
Lee et al., 2022 | K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment | https://github.com/adlnlp/K-MHaS | Gender or sexual orientation as category | Korean entertainment news aggregation platform, Korean News Comments | Korean | Text | Random | Jan. 2018 - June 2020 | 5 Korean speakers | No | No | Partial |
Leite et al., 2020 | Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis | https://github.com/JAugusto97/ToLD-Br | Misogyny | Portuguese (Brazilian) | Text | Keywords, hashtags, targeted users | July - Aug. 2019 | 42 volunteers at a university; 3 per item | No | Yes | Yes | |
Lynn et al., 2019 | Urban Dictionary definitions dataset for misogyny speech detection | https://ieee-dataport.org/documents/urban-dictionary-definitions-dataset-misogyny-speech-detection | Misogyny | Urban Dictionary | English | Text | Keywords | 1999 - 2006 | 3 independent researchers | No | No | No |
Mathew et al., 2021 | HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection | https://github.com/hate-alert/HateXplain | Women as target | Gab, Twitter | English | Text | Keywords | Gab: unknown; Twitter: twitter Jan. 2019 - June 2020 | 253 crowdworkers | No | No | No |
Mulki and Ghanem, 2021 | Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language | https://github.com/bilalghanem/let-mi | Misogyny | Arabic (Levantine) | Text | Targeted journalists' accounts | Oct. - Nov. 2019 | 3: 1 male and 2 females Levantine speakers | No | No | No | |
Mollas et al., 2022 | ETHOS: a multi-label hate speech detection dataset | https://github.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset | Gender as category | Reddit, Youtube | English | Text | Automated classification, targeted subreddits | Unknown - Oct. 2017 | Crowdworkers - 5 per item | No | No | No |
Moon et al., 2020 | BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection | https://github.com/kocohub/korean-hate-speech | 'Gender bias' as category | NAVER news | Korean | Text | Article views, stratified sampling by Wilson score; random | Jan. 2018 - Feb. 2020 | 32: 29 crowdworkers, 3 NLP researchers; 3 per item | No | No | No |
Ousidhoum et al., 2019 | Multilingual and Multi-Aspect Hate Speech Analysis | https://github.com/HKUST-KnowComp/MLMA_hate_speech | Gender as target | Arabic, English, French | Text | Keywords | Unknown | Crowdworkers; 3 per item | No | No | No | |
Petrak and Krenn, 2022 | Misogyny classification of German newspaper forum comments | N/A | Misogyny, sexism | Austrian newspaper | German | Text | Unknown | Unknown | 8: 7 experienced moderators, 3 male, 5 female | No | No | No |
Plaza et al., 2023 | Overview of EXIST 2023: sEXism Identification in Social NeTworks | http://nlp.uned.es/exist2023/ | Sexism | Gab, Twitter | English, Spanish | Text | Keywords, random | Sept. 2021 - Sept. 2022 | 6 crowdworkers: 2 social/demographic parameters: gender (male/female), age (18-22/23-45/46+) | No | No | No |
de Pelle and Moreira, 2017 | Offensive Comments in the Brazilian Web: a dataset and baseline results | https://github.com/rogersdepelle/OffComBR | Sexism as category | Globo news | Portuguese (Brazilian) | Text | Targeted website sections | Unknown | 5 volunteers; 3 per item | No | No | No |
Rizwan et al., 2020 | Hate-Speech and Offensive Language Detection in Roman Urdu | https://github.com/haroonshakeel/roman_urdu_hate_speech | Sexism as category | Urdu | Text | Keywords | Unknown | 3 | No | No | No | |
Rodríguez-Sánchez et al., 2020 | Automatic Classification of Sexism in Social Networks: An Empirical Study on Twitter Data | https://github.com/franciscorodriguez92/MeTwo | Sexism | Spanish | Text | Keywords, random | July - Dec. 2018 | 4 | No | No | No | |
Rodríguez-Sánchez et al., 2021 | Overview of EXIST 2021: sEXism Identification in Social neTworks | http://nlp.uned.es/exist2021/ | Sexism | Gab, Twitter | English, Spanish | Text | Keywords, hashtags | Twitter: Dec. 2020 Feb. 2021; Gab: Sept. 2016 -Aug. 2019 (Spanish), Aug. 2016 - Aug. 2019 (English) | 7: 5 crowdworkers, 2 experts in gender issues (1 man, 1 woman) | No | No | No |
Rodríguez-Sánchez et al., 2022 | Overview of EXIST 2022: sEXism Identification in Social neTworks | http://nlp.uned.es/exist2022/ | Sexism | Gab, Twitter | English, Spanish | Text | Keywords | Jan. 2022 | 6 experts in gender issues (3 men, 3 women) | No | No | No |
Romim et al., 2022 | BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts | https://github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media | Gender as category, male/female as targets | Facebook, TikTok, Youtube | Bangla | Text | Keywords, topics | 2017 - unknown | 50 students (32 male, 18 female) | No | No | No |
Samory et al., 2021 | “Call me sexist, but...” : Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples | https://search.gesis.org/research_data/SDN-10.7802-2251?doi=10.7802/2251 | Sexism | English | Text | Keywords, hashtags | 2008 - 2019 | 5 crowdworkers | No | No | No | |
Sharifirad and Jacovi, 2019 | Learning and Understanding Different Categories of Sexism Using Convolutional Neural Network's Filters | https://github.com/simarad1525/Dataset-to-detect-different-types-of-sexist-language | Sexism | English | Text | Hashtags | Unknown | 13: 1 male and 12 female non-activists | No | No | No | |
Sharifirad and Matwin, 2019 | When a Tweet is Actually Sexist. A more Comprehensive Classification of Different Online Harassment Categories and The Challenges in NLP | https://github.com/simarad1525/ECML_SIMAH_dataset_competition | Sexism | English | Text | Hashtags | Unknown | Crowdworkers | No | No | No | |
Strathern and Pfeffer, 2022 | Identifying Different Layers of Online Misogyny | Forthcoming | Misogyny | English | Text | Specific user handle: @realamberheard | 2019-2021 | 2: 1 author, 1 student | No | No | No | |
Talat, 2016 | Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter | https://github.com/zeeraktalat/hatespeech | Sexism as category | English | Text | Hashtags | Unknown | 1 expert (feminist and anti-racism activists) + 3 others per item | No | Yes | No | |
Talat and Hovy, 2016 | Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter | https://github.com/zeeraktalat/hatespeech | Sexism as category | English | Text | Hashtags | Unknown | 2: author + woman studying gender studies, non-activist feminist | No | No | No | |
Toosi, 2019 | Twitter sentiment analysis | https://www.kaggle.com/datasets/arkhoshghalb/twitter-sentiment-analysis-hatred-speech | Sexism | English | Text | Unknown | Unknown | Unknown | No | No | No | |
Vidgen et al., 2021 | Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection | https://github.com/bvidgen/Dynamically-Generated-Hate-Speech-Dataset | Gender as target | Synthetically generated social media | English | Text | N/A | Ongoing | Recruited | No | No | Yes |
Yadav et al., 2023 | LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification | N/A | Sexism as category | Arabic, English, French, German, Hindi, Spanish | Text | Keywords | Unknown | Unknown | No | No | No | |
Zeinert et al., 2021 | Annotating Online Misogyny | https://huggingface.co/datasets/strombergnlp/bajer_danish_misogyny | Misogyny | Facebook, Twitter, Reddit | Danish | Text | Keywords | Unknown | 8 recruited: 6 female, 2 male | No | No | Yes |