Supporting repo for Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution.
CogSci 2023
First, install all packages used in this repository:
pip install -r requirements.txt
Tested on python 3.7.
From jsonl raw results file to a list of correct and incorrect sentence ids:
python analyze_models/generate_result_ids_lists.py --out_path --out_path experiment_results/processed/models_eval.json --model_name s2e --model_name SpanBERT
Generate QA results table from humans raw csv results.
BUG:
python analysis_scripts/QA/generate_human_results_table.py --dataset BUG --out_path experiment_results/processed/humans/QA/BUG
Wino:
python analysis_scripts/QA/generate_human_results_table.py --dataset wino --out_path experiment_results/processed/humans/QA/wino
Generate plots from humans raw results over the MAZE task.
BUG:
python analysis_scripts/MAZE/BUG_plots.py --out_path experiment_results/processed/humans/MAZE/BUG
Wino:
python analysis_scripts/MAZE/wino_plots.py --out_path experiment_results/processed/humans/MAZE/wino
DELTA plots (difference between pro and anti stereotype for both datasets):
python analysis_scripts/MAZE/delta_plots.py --results_dir experiment_results/processed/humans/MAZE
Generate a combined table results:
python analysis_scripts/QA/generate_combined_results.py --models_path experiment_results/processed/models/eval.json --humans_path experiment_results/processed/humans/QA/ --out experiment_results/processed/final/QA_combined_table.md
Plot the results on a graph:
python analysis_scripts/QA/plot_results.py --in_results_md experiment_results/processed/QA_combined_table.md --out experiment_results/processed/QA_plot.png
Qualitative analysis:
python analysis_scripts/QA/qualitative_analysis.py --models_path experiment_results/processed/models/eval.json --humans_path experiment_results/processed/humans/QA --out experiment_results/processed/final
Generate a combined plot for humans and models, one for wino and one for BUG:
python analysis_scripts/MAZE/with_models_plot.py --human_results experiment_results/processed/humans/MAZE --models_path experiment_results/processed/models/eval.json --dataset wino --out experiment_results/processed/final
python analysis_scripts/MAZE/with_models_plot.py --human_results experiment_results/processed/humans/MAZE --models_path experiment_results/processed/models/eval.json --dataset BUG --out experiment_results/processed/final
analysis_scripts: This directory conatains all the code for analyzing and comparing human and model results.
experiment_results: Contains all the results. experiment_results/raw: raw results of the two human evaluation tasks, as well as models raw results. experiment_results/processed: results of all scripts from analysis_scripts.
original_data: The original datasets (winogender, winobias, gold BUG), as well as some metadata.
In order to use the above script to analyze a new coreference model, the following steps need to be maid
- Run your model over BUG and wino datasets.
- In order to run the model results processing script, you will need to generate a jsonl file for each dataset results. See format example here.
- Place your jsonl files (one for wino, one for BUG) under experiment_results/raw/models/.
- Run analyze_models/generate_result_ids_lists.py with the option "--model ", as well as all other necessary options as mentioned above.
- Run all the scripts from the 'Compare human and model results' part.