Table of Contents generated with DocToc
A Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation (Levy et al., Findings of EMNLP 2021).
BUG was collected semi-automatically from different real-world corpora, designed to be challenging in terms of soceital gender role assignements for machine translation and coreference resolution.
- Unzip
data.tar.gz
this should create adata
folder with the following files:- balanced_BUG.csv
- full_BUG.csv
- gold_BUG.csv
- Setup a python 3.x environment and install requirements:
pip install -r requirements.txt
NOTE: These partitions vary slightly from those reported in the paper due improvments and bug fixes post submission. For reprducibility's sake, you can access the dataset from the submission here.
105,687 sentences with a human entity, identified by their profession and a gendered pronoun.
1,717 sentences, the gold-quality human-validated samples.
25,504 sentences, randomly sampled from Full BUG to ensure balance between male and female entities and between stereotypical and non-stereotypical gender role assignments.
Each file in the data folder is a csv file adhering to the following format:
Column | Header | Description |
---|---|---|
1 | sentence_text | Text of sentences with a human entity, identified by their profession and a gendered pronoun |
2 | tokens | List of tokens (using spacy tokenizer) |
3 | profession | The entity in the sentence |
4 | g | The pronoun in the sentence |
5 | profession_first_index | Words offset of profession in sentence |
6 | g_first_index | Words offset of pronoun in sentence |
7 | predicted gender | 'male'/'female' determined by the pronoun |
8 | stereotype | -1/0/1 for anti-stereotype, neutral and stereotype sentence |
9 | distance | The abs distance in words between pronoun and profession |
10 | num_of_pronouns | Number of pronouns in the sentence |
11 | corpus | The corpus from which the sentence is taken |
12 | data_index | The query index of the pattern of the sentence |
See below instructions for reproducing our evaluations on BUG.
- Download the Spanbert predictions from this link.
- Unzip and put
coref_preds.jsonl
in in thepredictions/
folder. - From
src/evaluations/
, runpython evaluate_coref.py --in=../../predictions/coref_preds.jsonl --out=../../visualizations/delta_s_by_dist.png
. - This should reproduce the coreference evaluation figure.
To convert each data partition to CoNLL format run:
python convert_to_conll.py --in=path/to/input/file --out=path/to/output/file
For example, try:
python convert_to_conll.py --in=../../data/gold_BUG.csv --out=./gold_bug.conll
- Download the wanted SPIKE csv files and save them all in the same directory (directory_path).
- Make sure the name of each file end with
\_<corpusquery><x>.csv
wherecorpus
is the name of the SPIKE dataset andx
is the number of query you entered on search (for example - myspikedata_wikipedia18.csv). - From
src/evaluations/
, runpython Analyze.py directory_path
. - This should reproduce the full dataset and balanced dataset.
@misc{levy2021collecting,
title={Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation},
author={Shahar Levy and Koren Lazar and Gabriel Stanovsky},
year={2021},
eprint={2109.03858},
archivePrefix={arXiv},
primaryClass={cs.CL}
}