Cognitive Gender Bias Evaluation

Supporting repo for Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution.

Gili Lior, Gabriel Stanovsky

CogSci 2023

Reproducing our results

First, install all packages used in this repository:

pip install -r requirements.txt

Tested on python 3.7.

Model results parsing

From jsonl raw results file to a list of correct and incorrect sentence ids:

python analyze_models/generate_result_ids_lists.py --out_path --out_path experiment_results/processed/models_eval.json --model_name s2e --model_name SpanBERT

QA results parsing

Generate QA results table from humans raw csv results.

BUG:

python analysis_scripts/QA/generate_human_results_table.py --dataset BUG --out_path experiment_results/processed/humans/QA/BUG

Wino:

python analysis_scripts/QA/generate_human_results_table.py --dataset wino --out_path experiment_results/processed/humans/QA/wino

MAZE results parsing

Generate plots from humans raw results over the MAZE task.

BUG:

python analysis_scripts/MAZE/BUG_plots.py --out_path experiment_results/processed/humans/MAZE/BUG

Wino:

python analysis_scripts/MAZE/wino_plots.py --out_path experiment_results/processed/humans/MAZE/wino

DELTA plots (difference between pro and anti stereotype for both datasets):

python analysis_scripts/MAZE/delta_plots.py --results_dir experiment_results/processed/humans/MAZE

Compare human and model results

QA

Generate a combined table results:

python analysis_scripts/QA/generate_combined_results.py --models_path experiment_results/processed/models/eval.json --humans_path experiment_results/processed/humans/QA/ --out experiment_results/processed/final/QA_combined_table.md

Plot the results on a graph:

python analysis_scripts/QA/plot_results.py --in_results_md experiment_results/processed/QA_combined_table.md --out experiment_results/processed/QA_plot.png

Qualitative analysis:

python analysis_scripts/QA/qualitative_analysis.py --models_path experiment_results/processed/models/eval.json --humans_path experiment_results/processed/humans/QA --out experiment_results/processed/final

MAZE

Generate a combined plot for humans and models, one for wino and one for BUG:

python analysis_scripts/MAZE/with_models_plot.py --human_results experiment_results/processed/humans/MAZE --models_path experiment_results/processed/models/eval.json --dataset wino --out experiment_results/processed/final

python analysis_scripts/MAZE/with_models_plot.py --human_results experiment_results/processed/humans/MAZE --models_path experiment_results/processed/models/eval.json --dataset BUG --out experiment_results/processed/final

Directory mapping

analysis_scripts: This directory conatains all the code for analyzing and comparing human and model results.

experiment_results: Contains all the results. experiment_results/raw: raw results of the two human evaluation tasks, as well as models raw results. experiment_results/processed: results of all scripts from analysis_scripts.

original_data: The original datasets (winogender, winobias, gold BUG), as well as some metadata.

Evaluating a new model

In order to use the above script to analyze a new coreference model, the following steps need to be maid

Run your model over BUG and wino datasets.
In order to run the model results processing script, you will need to generate a jsonl file for each dataset results. See format example here.
Place your jsonl files (one for wino, one for BUG) under experiment_results/raw/models/.
Run analyze_models/generate_result_ids_lists.py with the option "--model ", as well as all other necessary options as mentioned above.
Run all the scripts from the 'Compare human and model results' part.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
analysis_scripts		analysis_scripts
experiment_results		experiment_results
original_data		original_data
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cognitive Gender Bias Evaluation

Reproducing our results

Model results parsing

QA results parsing

MAZE results parsing

Compare human and model results

QA

MAZE

Directory mapping

Evaluating a new model

About

Releases

Packages

Contributors 2

Languages

SLAB-NLP/Cog-GB-Eval

Folders and files

Latest commit

History

Repository files navigation

Cognitive Gender Bias Evaluation

Reproducing our results

Model results parsing

QA results parsing

MAZE results parsing

Compare human and model results

QA

MAZE

Directory mapping

Evaluating a new model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages