Skip to content

This project assesses the efficacy of contemporary approaches' ability to de-gender-bias coreference resolution models (duplicating the corpus and swapping pronouns with their respective counterparts, and neutralizing the direction of gender within individual word embedding vectors) when applied to word embedding models. We found direction neutr…

Notifications You must be signed in to change notification settings

Joey-Rose/nlp_final_project

Repository files navigation

Mitigating Gender Bias in Word Embeddings Via Corpus Augmentation and Vector Space Debiasing

Project Overview

Multiple methods have been explored with the goal of mitigating gender bias in the domain of natural language processing. Two of the most prominent tackle the problem from different perspectives: debiasing training data and debiasing word embeddings. We took inspiration from Bolukbasi et al.'s 2016 paper as well as from Zhao et al.

The corpus, LitBank, is stored in the novels directory, with each novel saved as a .txt file. The data directory contains the lists of adjectives we use to evaluate the performance of the models, as well as dictionaries and lists from Bolukbasi containing gendered word-pairs that are used when calling their methods in we.py and debias.py. The base models are generated from produce_models.py, and are stored as binary files in the project root. In test_models.py, the binaries are loaded as Word2Vec objects. There is functionality in that file to create the word embedding lists from those models, but those embeddings are also stored as .txt files to save on compute time. Results generated from test_models.py are stored as CSV files in the project root, as well as further analysis files that have been done in Excel.

Running the project

We recommend first setting up a virtual environment through pip or through conda. Clone the project to your local machine, navigate to the project directory, and activate your environment. Once that is done, run:

$ pip install -r requirements.txt

If that doesn't work, try:

$ conda install -r requirements.txt

You should now be able to run any of the files from the project. test_models.py is where a lot of the magic happens.

Limitations

Some of the models do reduce the performance on useful gendered associations, something that can affect other downstream NLP tasks if one chooses to build off of our system. Secondly, two of the models make use of a technique that nearly doubles the corpus size. One of these that does not incorporate vector space debiasing does not show the most sizable improvement over the baseline model. Thus, if one chooses a larger corpus to train on, the gains against training time could be negligible or not worth it.

About

This project assesses the efficacy of contemporary approaches' ability to de-gender-bias coreference resolution models (duplicating the corpus and swapping pronouns with their respective counterparts, and neutralizing the direction of gender within individual word embedding vectors) when applied to word embedding models. We found direction neutr…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages