Mitigating Gender Bias in Word Embeddings Via Corpus Augmentation and Vector Space Debiasing

Project Overview

Multiple methods have been explored with the goal of mitigating gender bias in the domain of natural language processing. Two of the most prominent tackle the problem from different perspectives: debiasing training data and debiasing word embeddings. We took inspiration from Bolukbasi et al.'s 2016 paper as well as from Zhao et al.

The corpus, LitBank, is stored in the novels directory, with each novel saved as a .txt file. The data directory contains the lists of adjectives we use to evaluate the performance of the models, as well as dictionaries and lists from Bolukbasi containing gendered word-pairs that are used when calling their methods in we.py and debias.py. The base models are generated from produce_models.py, and are stored as binary files in the project root. In test_models.py, the binaries are loaded as Word2Vec objects. There is functionality in that file to create the word embedding lists from those models, but those embeddings are also stored as .txt files to save on compute time. Results generated from test_models.py are stored as CSV files in the project root, as well as further analysis files that have been done in Excel.

Running the project

We recommend first setting up a virtual environment through pip or through conda. Clone the project to your local machine, navigate to the project directory, and activate your environment. Once that is done, run:

$ pip install -r requirements.txt

If that doesn't work, try:

$ conda install -r requirements.txt

You should now be able to run any of the files from the project. test_models.py is where a lot of the magic happens.

Limitations

Some of the models do reduce the performance on useful gendered associations, something that can affect other downstream NLP tasks if one chooses to build off of our system. Secondly, two of the models make use of a technique that nearly doubles the corpus size. One of these that does not incorporate vector space debiasing does not show the most sizable improvement over the baseline model. Thus, if one chooses a larger corpus to train on, the gains against training time could be negligible or not worth it.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
__pycache__		__pycache__
data		data
novels		novels
.gitattributes		.gitattributes
README.md		README.md
adjective_analysis.xls		adjective_analysis.xls
adjective_results.csv		adjective_results.csv
adjective_results_v2.csv		adjective_results_v2.csv
debias.py		debias.py
gendered_words_results.csv		gendered_words_results.csv
occupation_results.csv		occupation_results.csv
occupation_results_analysis.xls		occupation_results_analysis.xls
overall_results.xlsx		overall_results.xlsx
produce_models.py		produce_models.py
requirements.txt		requirements.txt
test_models.py		test_models.py
vanilla_embeddings.txt		vanilla_embeddings.txt
vanilla_model.bin		vanilla_model.bin
vanilla_model.bin.trainables.syn1neg.npy		vanilla_model.bin.trainables.syn1neg.npy
vanilla_model.bin.wv.vectors.npy		vanilla_model.bin.wv.vectors.npy
we.py		we.py
wino_embeddings.txt		wino_embeddings.txt
wino_model.bin		wino_model.bin
wino_model.bin.trainables.syn1neg.npy		wino_model.bin.trainables.syn1neg.npy
wino_model.bin.wv.vectors.npy		wino_model.bin.wv.vectors.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mitigating Gender Bias in Word Embeddings Via Corpus Augmentation and Vector Space Debiasing

Project Overview

Running the project

Limitations

About

Releases

Packages

Contributors 2

Languages

Joey-Rose/nlp_final_project

Folders and files

Latest commit

History

Repository files navigation

Mitigating Gender Bias in Word Embeddings Via Corpus Augmentation and Vector Space Debiasing

Project Overview

Running the project

Limitations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages