Skip to content

HWU-NLP/GBV-Resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 

Repository files navigation

GBV-Resources

This repository serves as a comprehensive collection of resources for the automated identification of online Gender-Based Violence (GBV) and related phenomena.

For further details, see our systematic review paper:

Gavin Abercrombie, Aiqi Jiang, Poppy Gerrard-Abbott, Ioannis Konstas, and Verena Rieser. 2023. Resources for Automated Identification of Online Gender-Based Violence: A Systematic Review. Proceedings of the 7th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics.

Bibtex:

@inproceedings{abercrombie-etal-2023-resources,
    title = {Resources for Automated Identification of Online {G}ender-{B}ased {V}iolence: A Systematic Review},
    author = {Abercrombie, Gavin} and Jaing, Aiqi and Gerrard-Abbot, Poppy and Konstas, Ioannis and Rieser, Verena},
    booktitle = {Proceedings of the 7th Workshop on Online Harms and Abuse},
    month = {July},
    year = {2023},
    address = {Toronto},
    publisher = {Association for Computational Linguistics},
}

Contribute to this list

Something missing?

You can contribute to this list by editing this file and making a pull request.

Please follow this template and add details at the bottom of the list.

Template:

| Reference | Title | Dataset URL | GBV characterisation | Platform | Language | Modality | Sampling | Date of data | Annotators | IRB | Non-aggregated labels | Data Statement |
  • Reference: Link to publication or description of the resource
  • Title: Publication or dataset name
  • Dataset URL: Link to the dataset
  • GBV characterisation: How is GBV described (e.g. 'misogyny', `gender' as a hate speech target)
  • Platform: e.g. Twitter, TikTok etc.
  • Language: e.g. Basque, Scottish Gaelic, Mi’kmaq etc.
  • Modality: e.g. Text, Video, Meme etc.
  • Sampling: How the source data was sampled e.g. keywords, targeted accounts, random etc.
  • Date of data: The dates between which the source data was produced/published
  • Annotators: The number and type of annotators e.g. 3 students, 10000 crowdworkers, 5 per item etc.
  • IRB: Whether the study passed ethical review by an Institutional Review Board or similar
  • Non-aggregated labels: Whether the non-aggregated labels have been released
  • Data Statement: Whether there is a data statement (Bender & Friedman, TACL 2018) describing the resource

Datasets

Reference Title Dataset URL GBV characterisation Platform Language Modality Sampling Date of data Annotators IRB Non-aggregated labels Data Statement
Al-Hassan and Al-Dossari, 2022 Detection of hate speech in Arabic tweets using deep learning N/A Sexism Twitter Arabic Text Keywords Unknown 2 volunteers No No No
Almanea and Poesio, 2022 ArMIS - The Arabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements https://codalab.lisn.upsaclay.fr/competitions/6146#learn_the_details-get_starting_kit Misogyny, Sexism Twitter Arabic Text Keywords October 2020 3 main annotators and 32 others. Self-defined beliefs and gender No Yes No
Alsafari et al., 2020 Hate and offensive speech detection on Arabic social media https://github.com/sbalsefri/ArabicHateSpeechDataset Gender as category Twitter Arabic (Gulf) Text keywords, hashtags, profiles April - September 2019 3: 2 women, 1 man No No No
Anzovino et al., 2018 Automatic Identification and Classification of Misogynistic Language on Twitter https://amievalita2018.wordpress.com/data/ Misogyny Twitter English Text keywords, hashtags, mentions of potential harassed users, self-declared mysoginist profiles 2017 3 experts + crowdworkers No No Yes
Assenmacher et al., 2021 RP-Mod & RP-Crowd: Moderator- and Crowd-Annotated German News Comment Datasets https://zenodo.org/record/5291339#.Y6RfyOLP3S6 Sexism Rheinische Post German Text Comments blocked by community managers Nov. 2018 - June 2020 5 per item No No No
Basile et al., 2019 SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter https://competitions.codalab.org/competitions/19935 Women as target Twitter English, Spanish Text Victims of hate accounts; identified haters; keywords July 2018 - Sept. 2018 + from earlier datasets Crowdworkers No No No
Bhattacharya et al., 2020 Developing a Multilingual Annotated Corpus of Misogyny and Aggression https://sites.google.com/view/trac2/shared-task?pli=1 Misogyny Facebook, Twitter, YouTube Bangla, English, Hindi, code-mixed Text Topics Unknown 4 linguists 'expected to have a centrist or left-leaning political orientation' No No No
Borkan et al., 2019 Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification https://www.tensorflow.org/datasets/catalog/civil_comments Gender as category, subgroups: Male, Female, Transgender, Other gender Comment forums English Text Unknown Unknown Crowdworkers Unknown No No
Bosco et al. Overview of the EVALITA 2018 Hate Speech Detection Task http://www.di.unito.it/~tutreeb/haspeede-evalita18/data.html 'Gender issues'-based hate Facebook, Twitter Italian Text Facebook: targeted pages and groups; Twitter: keywords Facebook: 2016; Twitter 2017-2018 Facebook: bachelor students; Twitter: experts and crowdworkers No No No
Cercas Curry et al., 2021 ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI https://github.com/amandacurry/convabuse Sexism, sexual harrassment Dialogue systems: ELIZA, CarbonBot (Facebook) English Text Stratified keyword CarbotBot: Oct. 2019 - Dec. 2020; ELIZA: Dec. 2002 - Nov. 2007 6 female and 2 non-binary Gender Studies students Yes Yes Yes
Chiril et al., 2021 "Be nice to your wife! The restaurants are closed": Can Gender Stereotype Detection Improve Sexism Classification? https://bit.ly/FrenchGenderStereotypes Sexism Twitter French Text Keywords, personal names, hashtags Unknown 1 male, 1 female students in Linguistics and Communication and Gender No No No
Chiril et al., 2019 Multilingual and Multitarget Hate Speech Detection in Tweets N/A Sexism Twitter 2 female and 1 male students in Communication and Gender French Text Oct. 2017 - May 2018 Keywords No No No
Chiril et al. 2020 An Annotated Corpus for Sexism Detection in French Tweets https://github.com/patriChiril/An-Annotated-Corpus-for-Sexism-Detection-in-French-Tweets Sexism Twitter French Text Keywords, hashtags, personal names Oct. 2017 - May 2018 3 female and 2 male Communication and Gender students No No No
Chung and Lin, 2021 TOCAB: A Dataset for Chinese Abusive Language Processing http://nlp.cse.ntou.edu.tw/resources/TOCAB/ Sex (gender, sexual orientation, or gender identity) as abuse category PTT (Taiwanese bulletin board) Chinese Text Popular posts Mar. 2019 - June 2019 12 students No No No
Das et al., 2022 Hate Speech and Offensive Language Detection in Bengali https://github.com/hate-alert/Bengali_Hate Gender as target Twitter Bengali Text Keywords Unknown 4 Computer Science students No No No
El Ansari et al., 2020 A Dataset to Support Sexist Content Detection in Arabic Text N/A Sexism, discrimination and Violence Against Women Twitter Arabic Text Keywords 2018 Volunteers No No No
Fanton et al., 2021 Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech https://github.com/marcoguerini/CONAN Women as target Semi-synthetic text English Text Unknown Unknown 3 student interns No No No
Fersini et al., 2018 Overview of the Task on Automatic Misogyny Identification at IberEval 2018 https://amiibereval2018.wordpress.com/important-dates/data/ Misogyny Twitter English Spanish Text Keywords July 2017 - Nov. 2017 Unknown + crowdworkers No No No
Fersini et al., 2020 AMI @ EVALITA2020: Automatic Misogyny Identification https://github.com/dnozza/ami2020 Misogyny Twitter Italian Text Unknown 2018 + 2020 Unknown No No No
Fersini et al., 2022 SemEval-2022 Task 5: Multimedia Automatic Misogyny Identification https://competitions.codalab.org/competitions/34175 Misogyny Twitter, Reddit; meme sites e.g., 9GaG, Knowyourmeme, Imgur English Memes Threads with women as the subject; anti-women accounts; (3) target victim accounts; (4) keywords and hashtags Unknown Unknown No No No
García-Díaz et al., 2021 Detecting misogyny in Spanish tweets. An approach based on linguistics features and word embeddings https://pln.inf.um.es/corpora/misogyny/misocorpus-spanish-2020.rar Misogyny Twitter Spanish Text Target accounts; geographical locations; keywords Unknown 5 women, 2 men: authors, 2 colleagues, 1 student No No No
Gomez et al., 2021 Exploring hate speech detection in multimodal publications https://drive.google.com/file/d/1S9mMhZFkntNnYdO-1dZXwF_8XIiFcmlF Sexism Twitter English Image + text Keywords Sept. 2018 - Feb. 2019 3 crowdworkers per item No No No
Gong et al., 2021 Abusive Language Detection in Heterogeneous Contexts: Dataset Collection and the Role of Supervised Attention N/A Gender and sexuality as target YouTube English Text Keywords 2017 17 psychology students, incl. 3 graduate students studying bullying and related phenomena No No No
Grosz and Conde-Cespedes, 2020 Automatic Detection of Sexist Statements Commonly Used at the Workplace https://github.com/dylangrosz/Automatic_Detection_of_Sexist_Statements_Commonly_Used_at_the_Workplace Sexism Twitter, work-related quotes, press quotes, faculty/student submissions English Text Keywords Unknown Authors: 1 male, 1 female No No No
Guellil et al., 2021 Sexism detection: The first corpus in Algerian dialect with a code-switching in Arabic/ French and English N/A Sexism YouTube Arabic (Algerian) Text Keywords and manually selected video IDs Feb. - Mar. 2019 3 Algerian Arabic speakers No No No
Guest et al., 2021 An Expert Annotated Dataset for the Detection of Online Misogyny https://github.com/ellamguest/online-misogyny-eacl2021 Misogyny Reddit English Text Targeted subreddits Feb. - May 2020 6 annotators trained (by the authors) to identify misogynistic content No No Yes
Hewitt et al., 2016 The problem of identifying misogynist language on Twitter (and other online social spaces) N/A Misogyny Twitter English Text Keywords Unknown 1 researcher No No No
Höfels et al., 2022 CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets https://aclanthology.org/2022.lrec-1.243/ Sexism Twitter Romanian Text Keywords May - Sept. 2021 10: 7 female, 3 male students in Languages and Literature and Modern Applied Languages 'with an interest/knowledge in gender studies' No Yes Partial
Ibrohim et al., 2019 Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection Hate based on gender as category Twitter Indonesian Text Previous datasets + keywords Mar. - Oct. 2018 + old data 30 from different backgrounds. 3 per item. No No No
Jha and Mamidi, 2017 When does a compliment become sexist? Analysis and classification of ambivalent sexism using Twitter data https://github.com/AkshitaJha/NLP_CSS_2017 Sexism Twitter English Text Keywords Unknown Authors + 3 23-year old non-activist feminists No No No
Jiang et al., 2022 SWSR: A Chinese dataset and lexicon for online sexism detection https://zenodo.org/record/4773875#.Y5DTMYLP3ao Sexism Sina Weibo Chinese Text Keywords June 2015 - June 2020 3: 2 female and one male PHD students No No No
Jeong et al., 2022 KOLD: Korean Offensive Language Dataset https://github.com/boychaboy/KOLD Gender and sexual orientation as target NAVER news, YouTube Korean Text Keywords Mar. 2020 - Mar. 2022 3,124 crowdworkers Yes No No
Kennedy et al., 2020 Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech Gender identity as target, sexist speech Twitter, Reddit, YouTube English Text Stratified sampling with identity relevance and hate speech hypothesis scores Mar. - Aug. 2019 7,912 crowdworkers No Yes No
Kennedy et al., 2022 Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale https://osf.io/edua3/ Gender identity as target Gab English Text Targeted data source (Gab) Jan. - Oct. 2018 Min. 3 per item: undergraduate research assistants No No No
Kirk et al., 2023 SemEval-2023 Task 10: Explainable Detection of Online Sexism https://github.com/rewire-online/edos Sexism Gab, Reddit English Text Targeted data sources Aug. 2016 Oct. 2018 19 women No Yes Yes
Kumar et al. 2018 Aggression-annotated Corpus of Hindi-English Code-mixed Data https://github.com/kraiyani/Facebook-Post-Aggression-Identification Gendered aggression Facebook, Twitter Hindi-English Text Targeted pages, hashtags Unknown 4 PhD Linguistics students No No No
Kwarteng et al., 2022 Misogynoir: challenges in detecting intersectional hate https://github.com/kwartengj/Snam2022 Misogynoir (misogyny aimed at Black women) Twitter English Text Unknown Unknown 3 No No No
Lee et al., 2022 K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment https://github.com/adlnlp/K-MHaS Gender or sexual orientation as category Korean entertainment news aggregation platform, Korean News Comments Korean Text Random Jan. 2018 - June 2020 5 Korean speakers No No Partial
Leite et al., 2020 Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis https://github.com/JAugusto97/ToLD-Br Misogyny Twitter Portuguese (Brazilian) Text Keywords, hashtags, targeted users July - Aug. 2019 42 volunteers at a university; 3 per item No Yes Yes
Lynn et al., 2019 Urban Dictionary definitions dataset for misogyny speech detection https://ieee-dataport.org/documents/urban-dictionary-definitions-dataset-misogyny-speech-detection Misogyny Urban Dictionary English Text Keywords 1999 - 2006 3 independent researchers No No No
Mathew et al., 2021 HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection https://github.com/hate-alert/HateXplain Women as target Gab, Twitter English Text Keywords Gab: unknown; Twitter: twitter Jan. 2019 - June 2020 253 crowdworkers No No No
Mulki and Ghanem, 2021 Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language https://github.com/bilalghanem/let-mi Misogyny Twitter Arabic (Levantine) Text Targeted journalists' accounts Oct. - Nov. 2019 3: 1 male and 2 females Levantine speakers No No No
Mollas et al., 2022 ETHOS: a multi-label hate speech detection dataset https://github.com/intelligence-csd-auth-gr/Ethos-Hate-Speech-Dataset Gender as category Reddit, Youtube English Text Automated classification, targeted subreddits Unknown - Oct. 2017 Crowdworkers - 5 per item No No No
Moon et al., 2020 BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection https://github.com/kocohub/korean-hate-speech 'Gender bias' as category NAVER news Korean Text Article views, stratified sampling by Wilson score; random Jan. 2018 - Feb. 2020 32: 29 crowdworkers, 3 NLP researchers; 3 per item No No No
Ousidhoum et al., 2019 Multilingual and Multi-Aspect Hate Speech Analysis https://github.com/HKUST-KnowComp/MLMA_hate_speech Gender as target Twitter Arabic, English, French Text Keywords Unknown Crowdworkers; 3 per item No No No
Petrak and Krenn, 2022 Misogyny classification of German newspaper forum comments N/A Misogyny, sexism Austrian newspaper German Text Unknown Unknown 8: 7 experienced moderators, 3 male, 5 female No No No
Plaza et al., 2023 Overview of EXIST 2023: sEXism Identification in Social NeTworks http://nlp.uned.es/exist2023/ Sexism Gab, Twitter English, Spanish Text Keywords, random Sept. 2021 - Sept. 2022 6 crowdworkers: 2 social/demographic parameters: gender (male/female), age (18-22/23-45/46+) No No No
de Pelle and Moreira, 2017 Offensive Comments in the Brazilian Web: a dataset and baseline results https://github.com/rogersdepelle/OffComBR Sexism as category Globo news Portuguese (Brazilian) Text Targeted website sections Unknown 5 volunteers; 3 per item No No No
Rizwan et al., 2020 Hate-Speech and Offensive Language Detection in Roman Urdu https://github.com/haroonshakeel/roman_urdu_hate_speech Sexism as category Twitter Urdu Text Keywords Unknown 3 No No No
Rodríguez-Sánchez et al., 2020 Automatic Classification of Sexism in Social Networks: An Empirical Study on Twitter Data https://github.com/franciscorodriguez92/MeTwo Sexism Twitter Spanish Text Keywords, random July - Dec. 2018 4 No No No
Rodríguez-Sánchez et al., 2021 Overview of EXIST 2021: sEXism Identification in Social neTworks http://nlp.uned.es/exist2021/ Sexism Gab, Twitter English, Spanish Text Keywords, hashtags Twitter: Dec. 2020 Feb. 2021; Gab: Sept. 2016 -Aug. 2019 (Spanish), Aug. 2016 - Aug. 2019 (English) 7: 5 crowdworkers, 2 experts in gender issues (1 man, 1 woman) No No No
Rodríguez-Sánchez et al., 2022 Overview of EXIST 2022: sEXism Identification in Social neTworks http://nlp.uned.es/exist2022/ Sexism Gab, Twitter English, Spanish Text Keywords Jan. 2022 6 experts in gender issues (3 men, 3 women) No No No
Romim et al., 2022 BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts https://github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media Gender as category, male/female as targets Facebook, TikTok, Youtube Bangla Text Keywords, topics 2017 - unknown 50 students (32 male, 18 female) No No No
Samory et al., 2021 “Call me sexist, but...” : Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples https://search.gesis.org/research_data/SDN-10.7802-2251?doi=10.7802/2251 Sexism Twitter English Text Keywords, hashtags 2008 - 2019 5 crowdworkers No No No
Sharifirad and Jacovi, 2019 Learning and Understanding Different Categories of Sexism Using Convolutional Neural Network's Filters https://github.com/simarad1525/Dataset-to-detect-different-types-of-sexist-language Sexism Twitter English Text Hashtags Unknown 13: 1 male and 12 female non-activists No No No
Sharifirad and Matwin, 2019 When a Tweet is Actually Sexist. A more Comprehensive Classification of Different Online Harassment Categories and The Challenges in NLP https://github.com/simarad1525/ECML_SIMAH_dataset_competition Sexism Twitter English Text Hashtags Unknown Crowdworkers No No No
Strathern and Pfeffer, 2022 Identifying Different Layers of Online Misogyny Forthcoming Misogyny Twitter English Text Specific user handle: @realamberheard 2019-2021 2: 1 author, 1 student No No No
Talat, 2016 Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter https://github.com/zeeraktalat/hatespeech Sexism as category Twitter English Text Hashtags Unknown 1 expert (feminist and anti-racism activists) + 3 others per item No Yes No
Talat and Hovy, 2016 Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter https://github.com/zeeraktalat/hatespeech Sexism as category Twitter English Text Hashtags Unknown 2: author + woman studying gender studies, non-activist feminist No No No
Toosi, 2019 Twitter sentiment analysis https://www.kaggle.com/datasets/arkhoshghalb/twitter-sentiment-analysis-hatred-speech Sexism Twitter English Text Unknown Unknown Unknown No No No
Vidgen et al., 2021 Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection https://github.com/bvidgen/Dynamically-Generated-Hate-Speech-Dataset Gender as target Synthetically generated social media English Text N/A Ongoing Recruited No No Yes
Yadav et al., 2023 LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification N/A Sexism as category Twitter Arabic, English, French, German, Hindi, Spanish Text Keywords Unknown Unknown No No No
Zeinert et al., 2021 Annotating Online Misogyny https://huggingface.co/datasets/strombergnlp/bajer_danish_misogyny Misogyny Facebook, Twitter, Reddit Danish Text Keywords Unknown 8 recruited: 6 female, 2 male No No Yes

About

A Repository of resources for classification of online Gender-Based Violence and related phenomena

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published