<a href="https://colab.research.google.com/github/DCSaunders/gender-debias/blob/master/gender_debias.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reducing gender bias in neural machine translation as a domain adaptation problem
This notebook walks through the paper [Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation Problem](https://arxiv.org/abs/2004.04498):



*   What the gender bias problem is in Neural Machine Translation (NMT), and how we can measure it.
*   How to debias an NMT model by *domain adaptation*: fine-tuning on a new, small, debiased dataset
*   How to avoid *catastrophic forgetting*: debiasing the model while keeping its general translation ability
*   How to use the debiased model for *lattice rescoring*: "correcting" words with the wrong gender inflection in machine translations.

This notebook walks through steps to reproduce our primary experiments, including tool installation, our baseline models and adaptation data.
<!-- 
Otherwise, to reproduce locally:

* Download the desired language pair files from the Google Drive https://drive.google.com/drive/folders/1XaWbXQQ8icZZE__fPBOj4dC24oiA3Qtv?usp=sharing  (For example, if you just want to reproduce our English-German experiments, you can ignore the folders ending with "enes" or "enhe" but should take everything else). These folders include our BPE vocabulary mappings and scripts to convert between plaintext and id-maps.
* Download Tensor2Tensor version 1.4.3 or more recent if you wish to use our baseline models for fine-tuning. The main repo is https://github.com/tensorflow/tensor2tensor - if you want to use our EWC implementation, check out https://github.com/DCSaunders/tensor2tensor/tree/ewc.v1.5
* Download OpenFST http://www.openfst.org/twiki/bin/view/FST/WebHome if you wish to generate translation lattices for lattice rescoring
* Download SGNMT https://github.com/ucam-smt/sgnmt if you wish to use our setup for decoding, including implementations for rescoring using OpenFST lattices
* Download sacrebleu https://github.com/mjpost/sacreBLEU/ for reproducible BLEU score evaluation
* Download WinoMT https://github.com/gabrielStanovsky/mt_gender for gender bias challenge set evaluation -->


## First set up all tools used in this notebook

In [0]:
# install miniconda
! wget https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh && bash Miniconda3-4.5.4-Linux-x86_64.sh -bfp /usr/local


In [0]:
# Install tools tensor2tensor dependency.   
# Restart runtime if prompted
!conda create -y -n tf_env_main pip python=3.6
! source activate tf_env_main
!pip install cloudpickle==1.2.0

!pip install -q -U tensor2tensor
# will require TF version 1.15.2 for tensor2tensor
# restart if prompted 
import sys
if 'google.colab' in sys.modules: # Colab-only TensorFlow version selector
  %tensorflow_version 1.15.2
import tensorflow as tf
import tensor2tensor
!pip install -q openfst-python
! pip install sacrebleu

Set up environment for SGNMT decoding

In [0]:
%%bash
conda install -y -q -c conda-forge openfst
conda create -y -n sgnmt_env pip python=3.6
conda install --name sgnmt_env -y -q -c conda-forge openfst
source activate sgnmt_env
git clone https://github.com/ucam-smt/sgnmt.git
pip install -q openfst-python
conda install --name sgnmt_env -y -q -f numpy pyyaml  scipy"==1.0.0"  tensorflow
conda install --name sgnmt_env -y -q -c conda-forge openfst

pip install tensor2tensor

FSTPATH=/usr/local/lib/python3.6/dist-packages/openfst_python
PYTHONPATH=$FSTPATH:$PYTHONPATH
## verify
echo $PYTHONPATH
python sgnmt/decode.py --run_diagnostics


Check that installations behave correctly inside python

Set up gender bias evaluation requirements


In [0]:
%%bash
echo "git clone fast_align"
git clone https://github.com/clab/fast_align.git
cd fast_align
mkdir -p build
cd build
echo "cmake"
cmake ..
echo "make"
make
echo "test fast align"
/content/fast_align/build/fast_align


In [0]:
%%bash
git clone https://github.com/gabrielStanovsky/mt_gender.git
cd mt_gender
FAST_ALIGN=/content/fast_align/
./install.sh

The files and scripts needed to run these GDB are available from a shared Google drive which you will need to link to your own Google drive, as follows:

1. In a seperate browser window, go to  https://drive.google.com/drive/folders/1XaWbXQQ8icZZE__fPBOj4dC24oiA3Qtv?usp=sharing 
1. Select `gender-debias-walkthrough`
1. Click on `Add shortcut to Drive`

You should now see a link to the folder `gender-debias-walkthrough` in `My Drive`.


Mount your google drive so this folder is visible in this notebook

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
! ls -d /content/drive/My\ Drive
! ls /content/drive/My\ Drive/gender-debias-walkthrough

# The gender bias problem

Let's say we want to train a neural network to translate from English to German. We can do this by showing the model lots of examples of real English sentences and their German translations.

But we have to get those example translations from somewhere. News reports, political speeches, Wikipedia and TED talks are common sources. Society being what it is, these tend to have more examples of men than of women.

In fact we can count the number of sentences in the English side of the dataset which contain masculine words like *he, him, his, man, men*, etc, and do the same for the equivalent feminine words. We find there are about twice as many masculine sentences.

German, like many languages, has grammatical gender. *The doctor* should be translated into German as *Der Arzt* if the doctor is male,  *Die &Auml;rztin* if the doctor is female. But because of data bias, machine translation models often get this wrong.



# Can we measure gender bias in translation?


Yes! [Stanovsky et al (2019)](https://www.aclweb.org/anthology/P19-1164.pdf)  recently developed the [WinoMT framework](https://github.com/gabrielStanovsky/mt_gender) for this purpose. WinoMT comes with a set of 3888 sentences, for example:

*The **physician** told the nurse that **he** had been busy.*

*The **physician** told the nurse that **she** had been busy.*



*   Each sentence has a primary entity -- *the physician* in these examples.
*  Each primary entity is coreferent with a pronoun -- *he* in the first example, *she* in the second.
* WinoMT is gender-balanced: there are the same total number of male and female physicians (and every other kind of primary entity)

If you translate these 3888 sentences from English to one of 8 supported languages with grammatical gender, WinoMT provides an automatic evaluation framework to see how much gender bias is in those translations:

*   It automatically aligns your translations with the English input to find your translated version of the primary entity
*   Using morphological analysis tools, WinoMT extracts the translated primary entity's grammatical gender
*   The objective is to achieve the correct grammatical gender for each sentence. WinoMT reports this **accuracy** as a percentage over all sentences.
*   Because WinoMT has the same number of male and female sentences, it can also report the difference in translation quality between them as **∆G**, the difference in [F1 score](https://en.wikipedia.org/wiki/F1_score) between sentences with male and female entities. If ∆G is above 0, it gets the correct gender for male entities more frequently than for female entities.

Stanovsky et al report results on several commercial translation systems, like Google and Bing translate. We can also use our own baseline model to translate the WinoMT set and evaluate it to see the same problem.





In [0]:
%%bash
BASEDIR='/content/drive/My Drive/gender-debias-walkthrough/'
WINOMT_PREP_SCRIPT="$BASEDIR/scripts/prep_for_winomt_eval.sh"
BASELINE_HYP="$BASEDIR/data_ende/winomt.ende.baseline-hyp.detok"
lang=de 
winomt_logs=winomtout
mkdir -p $winomt_logs
bash "$WINOMT_PREP_SCRIPT" "$BASELINE_HYP" $lang /content/mt_gender
cd /content/mt_gender/src
export FAST_ALIGN_BASE=/content/fast_align/
bash ../scripts/evaluate_all_languages.sh ../data/aggregates/en.txt $winomt_logs


# What can we do about gender bias in machine translation?
There's been increasing interest in reducing gender bias in language processing tasks in the last year or two. Proposed solutions include trying to [debias parts of the model before or during training](https://www.aclweb.org/anthology/W19-3821.pdf), or [producing multiple differently gendered translations in ambiguous cases](https://www.blog.google/products/translate/reducing-gender-bias-google-translate/). 

Our approach is based on domain adaptation: we want to adapt the model to a domain which has no gender bias, encouraging it to forget about any previously seen data bias.

# Adapting a model to balanced data

We can do this very quickly with a tiny dataset which we create in English and translate into target languages like German. The dataset contains sentences pairs like:


*The doctor finished his work. | Der Arzt beendete seine Arbeit.*

*The doctor finished her work. | Die Ärztin beendete ihre Arbeit.*

And so on for 194 professions, for a total of just 388 sentences.

Since the adaptation dataset is very small, we only iterate over it a few times before convergence. 

In [0]:
%%bash
source activate tf_env_main

BASEDIR='/content/drive/My Drive/gender-debias-walkthrough/'
SRC_DIR="$BASEDIR/baseline_ende"
T2T_USR_DIR="$BASEDIR/t2t-usr"
T2T="/tensorflow-1.15.2/python3.6/tensor2tensor/bin/"
TF="/tensorflow-1.15.2/python3.6/bin"
DATA_DIR="$BASEDIR/data_ende/handcrafted_ende"

BASE_STEPS=300000
adapt_steps=4
TRAIN_STEPS==$(( $BASE_STEPS + $adapt_steps ))
batch_size=4096

model_dir=model/ft/ende/
mkdir -p $model_dir

# make a local copy of T2T_USR_DIR in t2t-usr
cp "$SRC_DIR"/model* $model_dir/
cp "$SRC_DIR"/checkpoint $model_dir/

export PYTHONPATH="$T2T:$T2T_USR_DIR:$TF:$PYTHONPATH" 
t2t-trainer \
 --data_dir="$DATA_DIR" \
 --problem=translate_generic_existing_vocab --hparams_set=transformer_base \
 --t2t_usr_dir="$T2T_USR_DIR" --output_dir=$model_dir --model=transformer \
 --schedule=train --train_steps=300004 --keep_checkpoint_max=1  \
 --hparams="batch_size=$batch_size" 


# Inference with the adapted model
Having fine-tuned the model on the balanced dataset, we can use it to translate the WinoMT test set again, using SGNMT to handle beam search decoding. We first create a config file which specifies what we want to decode and how we want to decode.

To decode the WinoMT test set:

In [0]:
%%bash

BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
SRC_VOCAB_SIZE=$(wc -l < "$BASEDIR/data_ende/handcrafted_ende/vocab.src")
TRG_VOCAB_SIZE=$(wc -l < "$BASEDIR/data_ende/handcrafted_ende/vocab.trg")
T2T_USR_DIR="$BASEDIR/t2t-usr"
SRC_TEST="$BASEDIR/data_ende/winomt.ende.src.ids"
model_dir=model/ft/ende/
decode_dir=decode_adapted_winomt
mkdir -p $decode_dir

config_file=$decode_dir/decode.ini
echo "verbosity: debug
predictors: t2t
src_test: "$SRC_TEST"
decoder: beam
beam: 4
t2t_model: transformer
t2t_hparams_set: transformer_base
t2t_problem: translate_generic_existing_vocab
pred_src_vocab_size: "$SRC_VOCAB_SIZE"
pred_trg_vocab_size: "$TRG_VOCAB_SIZE"
indexing_scheme: t2t
t2t_usr_dir: "$T2T_USR_DIR"
t2t_unk_id: 3
output_path: "$decode_dir"/output.ids
t2t_checkpoint_dir: "$model_dir"
outputs: text" > $config_file


Then we activate the decoding environment and run decoding.

Warning: the  ```--range=1:1``` option decodes only the first sentence. Removing the range option will default to decoding the entire test set, which could take a long time.


In [0]:
%%bash 
source activate sgnmt_env
BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
T2T_USR_DIR="$BASEDIR/t2t-usr"
PYTHONPATH=$T2T_USR_DIR:$PYTHONPATH
echo $PYTHONPATH
python /content/sgnmt/decode.py --config_file=decode_adapted_winomt/decode.ini --range=1:1


Once decoding is finished and `$decode_dir/out.text` exists, we  convert the output to plaintext for evaluation and evalute using WinoMT as before (if the entire test set has not been decoded, WinoMT will throw an error about  "not enough values to unpack".)

In [0]:
%%bash  
BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
CONVERT_SCRIPT="$BASEDIR"/scripts/convert_id_to_detok.sh
WINOMT_PREP_SCRIPT="$BASEDIR"/scripts/prep_for_winomt_eval.sh
lang=de
winomt_logs=winomtout

bash "$CONVERT_SCRIPT" $lang detok < decode_adapted_winomt/output.ids 1> decode_adapted_winomt/output.detok
bash "$WINOMT_PREP_SCRIPT" decode_adapted_winomt/output.detok $lang /content/mt_gender
cd mt_gender/src
export FAST_ALIGN_BASE=/content/fast_align/
bash ../scripts/evaluate_all_languages.sh ../data/aggregates/en.txt $winomt_logs


# The catastrophic forgetting problem

Unfortuantely, a known downside of domain adaptation for neural models is that the models tend to experience `catastrophic forgetting' of anything they've seen previously.

In a way, we're relying on this effect! We want the model to quickly forget the gender bias it has seen during baseline training. 

But the NMT model has also seen good, fluent examples of translation which let it translate well in general. We don't want forgetting to impact this ability.

We can assess forgetting by measuring translation performance in [BLEU points](https://en.wikipedia.org/wiki/BLEU) on standard translation test sets. 

First we have to translate the test set just as we did for WinoMT.

To decode the general test set:

In [0]:
%%bash

BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
SRC_VOCAB_SIZE=$(wc -l < "$BASEDIR/data_ende/handcrafted_ende/vocab.src")
TRG_VOCAB_SIZE=$(wc -l < "$BASEDIR/data_ende/handcrafted_ende/vocab.trg")
T2T_USR_DIR="$BASEDIR/t2t-usr"
SRC_TEST="$BASEDIR/data_ende/test.ende.src.ids"
model_dir=model/ft/ende/
decode_dir=decode_adapted
mkdir -p $decode_dir

config_file=$decode_dir/decode.ini
echo "verbosity: debug
predictors: t2t
src_test: "$SRC_TEST"
decoder: beam
beam: 4
t2t_model: transformer
t2t_hparams_set: transformer_base
t2t_problem: translate_generic_existing_vocab
pred_src_vocab_size: "$SRC_VOCAB_SIZE"
pred_trg_vocab_size: "$TRG_VOCAB_SIZE"
indexing_scheme: t2t
t2t_usr_dir: "$T2T_USR_DIR"
t2t_unk_id: 3
output_path: "$decode_dir"/output.ids
t2t_checkpoint_dir: "$model_dir"
outputs: text" > $config_file

In [0]:
%%bash 
source activate sgnmt_env
BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
T2T_USR_DIR="$BASEDIR/t2t-usr"
PYTHONPATH=$T2T_USR_DIR:$PYTHONPATH
python /content/sgnmt/decode.py --config_file=decode_adapted/decode.ini --range=1:1


When the test set is translated we evaluate it automatically against reference translations using a tool called sacreBLEU.

In [0]:
%%bash  
BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
REF="$BASEDIR"/data_ende/test.ende.trg.detok 
CONVERT_SCRIPT="$BASEDIR"/scripts/convert_id_to_detok.sh
lang=de
bash "$CONVERT_SCRIPT" $lang detok < decode_adapted/output.ids 1> decode_adapted/output.detok
## If you have decoded the entire test set, run sacrebleu against the entire reference
head -1 "$REF" > ref.line1
sacrebleu ref.line1 < decode_adapted/output.detok


If you decode and evaluate the test set with both the baseline and adapted models, you'll see that there's a noticeable drop in translation performance as measured by BLEU.

One way around this is to constrain the adapted model to only produce gender-inflected versions of the baseline model's translation. This is a form of lattice rescoring. Essentially, it should ensure the adapted model can only deviate from the baseline translation to produce different gender inflections.

# Lattice rescoring: "Correcting" gender bias in translations

Our alternative solution doesn't try to avoid forgetting when adapting the model. Instead, we fully embrace the forgetting, which gives good improvements on gender debiasing. But during decoding, we try to keep the translation close to a previously-generated translation (which may contain gender-bias-related mistakes.)

In fact, we limit the translation model to produce only gender-inflected versions of the original translation.

For example our baseline model translates the English sentence in WinoMT:

*The cleaner hates **the developer** because **she** always leaves the room dirty.*

to the German

*Der Reiniger haßt **den Entwickler**, weil **er** den Raum immer schmutzig l&auml;sst.*

Both bolded parts in the German sentence are (incorrectly) masculine forms. 

But we can use automatic tools with giant vocabulary lists to find that in German, *Entwickler* has an alternate inflected form *Entwicklerin*, and the masculine pronoun *er* has an equivalent feminine inflection *sie*. 

Of course, we don't know which words in the translation might be wrongly inflected, so we go through this process and find alternate inflections for every word in every translated sentence. We represent this series of alternate inflections as a lattice using OpenFST tools.


In [0]:
%%bash  
BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
INFLECTED_VOCAB_MAPPINGS="$BASEDIR/data_ende/inflections-de"
FST_SYMS="$BASEDIR/data_shared/fst_syms"
APPLY_BPE_SCRIPT="$BASEDIR/scripts/apply_bpe.sh"
SENTENCE_TO_FST="$BASEDIR/scripts/sentence_to_fst.py"
INFLECTIONS_TO_FST="$BASEDIR/scripts/multi_options_to_fst.py"
CONVERT_SCRIPT="$BASEDIR"/scripts/convert_id_to_detok.sh
lang=de
bash "$CONVERT_SCRIPT" $lang < decode_adapted_winomt/output.ids 1> decode_adapted_winomt/output.tok

plaintext_hyps=decode_adapted_winomt/output.tok
sentence_count=$(wc -l < "$plaintext_hyps")

lattice_dir=winomt_lattices.de
mkdir -p $lattice_dir
for i in $(seq $sentence_count); do
    tmpfsttext=$lattice_dir/$i.tmp.fst
    awk "NR==$i" $plaintext_hyps | bash "$APPLY_BPE_SCRIPT" $lang | python "$SENTENCE_TO_FST" \
    | fstcompile --isymbols="$FST_SYMS"  --osymbols="$FST_SYMS" > $tmpfsttext;
    awk "NR==$i" $plaintext_hyps |  tr ' ' '\n' | sed 's/^/ /g' | sort -u \
    | fgrep -f - "$INFLECTED_VOCAB_MAPPINGS" | bash "$APPLY_BPE_SCRIPT" $lang | sed 's/ 1 / | /g' \
    | python "$INFLECTIONS_TO_FST" | fstcompile --isymbols="$FST_SYMS" --osymbols="$FST_SYMS" | fstcompose $tmpfsttext - |  fstminimize --allow_nondet=true \
    > $lattice_dir/$i.fst
    rm $tmpfsttext
done


When creating a config file for SGNMT decoding, we can now instruct the decoder to constrain search to the provided lattices using the "predictors" and "fst_path" options.

In [0]:
%%bash

BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
SRC_VOCAB_SIZE=$(wc -l < "$BASEDIR/data_ende/handcrafted_ende/vocab.src")
TRG_VOCAB_SIZE=$(wc -l < "$BASEDIR/data_ende/handcrafted_ende/vocab.trg")
T2T_USR_DIR="$BASEDIR/t2t-usr"
SRC_TEST="$BASEDIR/data_ende/winomt.ende.src.ids"
lattice_dir=winomt_lattices.de
model_dir=model/ft/ende/
decode_dir=decode_adapted_winomt.lattices
mkdir -p $decode_dir

config_file=$decode_dir/decode.ini
echo "verbosity: debug
predictors: t2t,nfst
src_test: "$SRC_TEST"
decoder: beam
beam: 4
t2t_model: transformer
t2t_hparams_set: transformer_base
t2t_problem: translate_generic_existing_vocab
pred_src_vocab_size: "$SRC_VOCAB_SIZE"
pred_trg_vocab_size: "$TRG_VOCAB_SIZE"
indexing_scheme: t2t
t2t_usr_dir: "$T2T_USR_DIR"
t2t_unk_id: 3
output_path: "$decode_dir"/output.ids
fst_path: "$lattice_dir/%d.fst"
t2t_checkpoint_dir: "$model_dir"
outputs: text" > $config_file


We can now decode with the debiased model, constraining it to produce only words in the lattice. Effectively it can only produce alternately-inflected forms of the original translation. Because the original, fluent baseline model gets control of the words in the translation, and the debiased model only changes inflections, we would hope that translation quality doesn't change much.

In [0]:
%%bash 
source activate sgnmt_env

FSTPATH=/usr/local/lib/python3.6/dist-packages/openfst_python

BASEDIR=/content/drive/My\ Drive/gender-debias-walkthrough
T2T_USR_DIR="$BASEDIR/t2t-usr"
PYTHONPATH="$FSTPATH:$T2T_USR_DIR:$PYTHONPATH"

python /content/sgnmt/decode.py --config_file=decode_adapted_winomt.lattices/decode.ini --range=1:1


If we do this to the general test set, we shouldn't see much change in BLEU. If we do this to the WinoMT test set, we see almost as much reduction in gender bias as for the adapted model.

One particularly good feature of lattice rescoring is that we don't actually need the baseline model at all - just its translation hypotheses. We can therefore also apply this lattice-rescoring method to the output of commercial translation systems as collected by Stanovsky et al, and reduce gender bias in those as well.

# Elastic Weight Consolidation: balancing gender debiasing and translation fluency

An alternative way to deal with the catastrophic forgetting problem is [Elastic Weight Consolidation (EWC)](https://arxiv.org/abs/1612.00796). 

The general idea of EWC is to estimate how important different parameters in the neural network are to a task (general translation ability.) 

Then, when adapting the model to a new task (gender debiasing) we just apply a larger penalty to changing a parameter if the parameter was important for the previous task.

EWC is not implemented in main Tensor2Tensor, so we first set up an environment for an older forked version with EWC implemented.

In [0]:
%%bash
# make local copy of env file so conda doesn't attempt to write to drive
cp "/content/drive/My Drive/gender-debias-walkthrough/data_shared/env.yml" /content/env.yml
conda create -y -n tf_env_ewc pip 
conda activate tf_env_ewc
conda env update --file /content/env.yml   -n tf_env_ewc
git clone --single-branch -b dsaunders_v1.4.3_modified-ewc  https://github.com/DCSaunders/tensor2tensor.git


Using this version of T2T we run adaptation to the same handcrafted set, now applying a loss penalty to the pre-saved EWC fisher variables:

In [0]:
%%bash 
tmppythonpath=$PYTHONPATH
source activate tf_cpu
BASEDIR='/content/drive/My Drive/gender-debias-walkthrough/'
SRC_DIR="$BASEDIR/baseline_ende"
EWC_VARS="$BASEDIR/baseline_ende/ewc_vars"

DATA_DIR="$BASEDIR/data_ende/handcrafted_ende"

BASE_STEPS=300000
adapt_steps=4
TRAIN_STEPS==$(( $BASE_STEPS + $adapt_steps ))
batch_size=4096
ewc_loss_weight=10000000
T2T="/content/tensor2tensor/"
export PYTHONPATH="$T2T:$PYTHONPATH" 

model_dir=model/ft/ende_ewc/
mkdir -p $model_dir

# make a local copy of the model
cp "$SRC_DIR"/model* $model_dir/
cp "$SRC_DIR"/checkpoint $model_dir/
ln -s "$EWC_VARS" "$model_dir/ewc_vars"

python $T2T/tensor2tensor/bin/t2t_trainer.py \
 --data_dir="$DATA_DIR" \
 --problems=translate_generic_existing_vocab --hparams_set=transformer_base \
 --output_dir=$model_dir --model=transformer \
 --schedule=train --train_steps=300004 --keep_checkpoint_max=1  \
--hparams="batch_size=$batch_size,ewc_load_vars=True,ewc_loss_weight=$ewc_loss_weight"
$PYTHONPATH=$tmppythonpath


Although EWC reduces catastrophic forgetting, it also results in some trade-off between reduced gender bias and general translation ability. It also cannot be applied to black-box translations like lattice rescoring can. However, it does involve only a single model and decoding pass.