<img src="../../Img/backdrop-wh.png" alt="Drawing" style="width: 300px;"/>  

# Word Embeddings

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Understand word embeddings, and how they are used in information retrieval.
* Use `gensim`'s `word2vec` method to create word vectors for a corpus.
* Use word embeddings to calculate word similarity.
* Use word embeddings to reflect on implicit biases in your data.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
💭 **Reflection**: Reflecting on ethical implications, biases, and social impact in data science.<br>

### Sections
1. [Word Embeddings](#we)
2. [Using Comments Data](#comments)
3. [Constructing a Word2Vec Model](#w2v)
4. [Word Similarity](#sim)
5. [Exploring Language Biases With Word Embeddings](#bias)


<a id='we'></a>

# Word Embeddings

The goal of word embedding models is to learn **numerical representations** of text corpora. Like with TF-IDF, it's all about how we construct that numerical representation.

When using word embeddings, we look for a **vector** of numbers to represent each term. The actual numbers themselves won't be meaningful to us as humans. However, if successful, the vectors for each term should encode information about the meaning or concept the term represents, as well as the relationship between it and other terms in the vocabulary.

In this notebook, we'll work with word embeddings using Word2Vec.

## Word2Vec

**Word2Vec** is one example of a word embeddings model. It learns by taking words and their contexts (e.g. sentences) into account, and can then try to predict other words. Given enough data, usage and contexts, Word2Vec can make accurate guesses about a word’s meaning based on its appearances. Those guesses can be used to establish a word’s association with other words (e.g. "Paris" is to "France" as “Berlin” is to “Germany”), or cluster documents and classify them by topic.

<img src="../../img/we.png" alt="Word Embeddings" width="600"/>

Word vector models such as Word2Vec are fully **unsupervised**: they learn all of these meanings and relationships without any advance knowledge. Unsupervised learning requires the specification of a right task. We won't go into detail in this lesson, but you can roughly think of the task as predicting nearby words, given a specific word. Read [this post](https://tomvannuenen.medium.com/analyzing-reddit-communities-with-python-part-6-word-embeddings-f92bba876d60) if you want a deeper introduction.

We will be using the `gensim` package, which offers Word2Vec.

In [1]:
# Package imports
import os
import pandas as pd
import numpy as np

import pickle
from gensim.models import Word2Vec
import multiprocessing

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

<a id='comments'></a>

# Using Comments Data

As we will be considering the language biases in the next notebook, we will use the **comments** of the r/amitheasshole subreddit this time. The thinking behind this is that this data will be derived from more people, and include more evaluative statements. After all, comments on r/amitheasshole generally evaluate the original posts.

💭 **Reflection**: Whether you want to be using submissions or comments for your own data depends on the question you are asking of that data! Make sure to think about this and discuss it with teammates.

To save us some time, we have done the preprocessing for you and saved the trigrams (a list of lists) in a pickle file.

In [2]:
with open('../../data/aita_comment_trigrams.pickle', 'rb') as f:
    # Load the object from the file
    comments = pickle.load(f)

In [3]:
# Split into lists of words for Word2Vec
comment_list = [comment.split() for comment in comments]
comment_list[0]

['like',
 'genuine',
 'reason',
 'upset',
 'question',
 'sound',
 'people',
 'ask',
 'curious',
 'direct',
 'question',
 'superior']

**Use the `3_Preprocessing_Project.ipynb` notebook if you want to preprocess your own comment data.**

**Use the `6_Word_Embeddings_Project.ipynb` notebook to run the Word2Vec operations explained in this notebook on your own data.**

<a id='w2v'></a>

# Constructing a Word2Vec Model

Let's create our word embeddings model. The input to this model is our corpus split up into words. The model's output is a set of "vectors" (one for each word) in $N$ dimensions (we choose $N$ before creating the model). Think of these vectors as "features", capturing latent meaning.

This model allows us to group the vectors of similar words together in vector space. We can then reduce the dimensionality to visualize the results in a way humans can understand (such as in a 2-dimensional space), or to perform linear algebra operations in order to find out to what extent words are related.

We now instantiate and train our Word2Vec model, using the parameters below.

In [4]:
cores = multiprocessing.cpu_count() # Number of cores at your disposal

n_features = 300     # Word vector dimensionality (how many features each word will be given)
min_word_count = 10  # Minimum word count to be taken into account
n_workers = cores    # Number of threads to run in parallel (equal to your amount of cores)
window = 5           # Context window size
downsampling = 1e-2  # Downsample setting for frequent words
seed = 1             # Seed for the random number generator (to create reproducible results)
sg = 1               # Skip-gram = 1, CBOW = 0
epochs = 20          # Number of iterations over the corpus

model = Word2Vec(
    sentences=comment_list,
    workers=n_workers,
    vector_size=n_features,
    min_count=min_word_count,
    window=window,
    sample=downsampling,
    seed=seed,
    sg=sg)

That was it! We have a Word Embeddings model now. Let's save it so that we don't have to train it again. Then, we can reload the embeddings so that we don't have to train it every single time:

In [5]:
model.save('../../data/aita_comments.emb')

In [6]:
model = Word2Vec.load('../../data/aita_comments.emb')

How many terms are in our vocabulary? Whenever interacting with the word vector dictionary, we use the `wv` attribute:

In [7]:
len(model.wv)

14732

Let's take a peek at the word vectors our model has learned. We can take a look at the individual words using the `index_to_key` attribute, and the word vectors themselves can be accessed with the `vectors` attribute:

In [8]:
model.wv.index_to_key[0]

'like'

In [9]:
model.wv.vectors[0]

array([-1.40707716e-01,  1.51243582e-01,  2.25934219e-02,  1.15743585e-01,
       -5.27389348e-02, -2.72344798e-02, -1.37464076e-01,  2.68600911e-01,
        7.35332593e-02,  3.62114375e-03,  6.44749543e-03,  2.05808267e-01,
       -1.41503572e-01, -2.80661583e-02, -1.54956356e-01,  1.07976114e-02,
        1.39130965e-01, -9.44793411e-03,  4.22123149e-02, -3.84143405e-02,
       -1.06682619e-02, -1.45837255e-02,  5.92960902e-02,  9.44894031e-02,
        1.34901136e-01,  1.43331466e-02, -4.92405444e-02,  5.14439233e-02,
        5.42582385e-02, -1.73705202e-02, -2.01079890e-01,  1.61337771e-03,
        1.68650270e-01,  2.37780899e-01,  6.43990263e-02, -9.23566371e-02,
        8.08502436e-02, -2.23057181e-01, -2.40070298e-01, -1.80703383e-02,
       -7.79113919e-02, -9.73377973e-02, -1.46270871e-01, -1.40510942e-03,
       -8.05554688e-02,  4.21688929e-02, -9.49909836e-02, -8.40258896e-02,
       -2.64396876e-01,  4.29840013e-02, -7.12746680e-02,  9.26958919e-02,
       -6.12078272e-02, -

Looking at it won't make a whole lot of sense to us! It's just a bunch of numbers. However, we can do semantic operations on these vectors, such as getting related terms.

<a id='sim'></a>

# Word Similarity

With the information in our word embeddings model, we can try to find similarities between words that interest us (i.e. words that have a similar vector). Let's create a function that retrieves related terms to some input. We're going to use the `most_similar()` function in `gensim` as part of this helper function.

In [10]:
def get_most_similar_terms(model, token, topn=20):
    """Look up the top N most similar terms to the token."""
    for word, similarity in model.wv.most_similar(positive=[token], topn=topn):
        print(f"{word}: {round(similarity, 3)}")

In [11]:
get_most_similar_terms(model, 'asshole')

redflag: 0.638
resounding: 0.619
asshole-: 0.619
assholeishness: 0.616
esh.i: 0.614
tah: 0.613
assholeness: 0.607
justice_boner: 0.601
fuckwad: 0.6
dickwad: 0.598
esh.you're: 0.597
nta.edit: 0.593
nahyou’re: 0.59
arsehole: 0.589
yta.not: 0.586
veer: 0.586
ness: 0.586
irredeemable: 0.583
ytayes: 0.582
ytayou’re: 0.582


Here are some other terms. What else interests you?

In [12]:
get_most_similar_terms(model, 'empathy')

compassion: 0.68
lacking: 0.584
humility: 0.577
rationality: 0.566
self_awareness: 0.543
lack_empathy: 0.537
modicum: 0.533
patience: 0.528
devoid: 0.522
self_reflection: 0.516
tact: 0.507
empathic: 0.506
awareness: 0.506
obliviousness: 0.505
helplessness: 0.503
empathetic: 0.501
cold_hearted: 0.5
critical_thinking: 0.5
coping_skill: 0.498
sympathy: 0.494


In [13]:
get_most_similar_terms(model, 'relationship')

relationships: 0.604
irrevocably: 0.593
transparency: 0.59
exclusivity: 0.583
polyamory: 0.582
friendship: 0.578
marriage: 0.577
amicably: 0.569
openness: 0.568
coparent: 0.568
reciprocal: 0.56
communicative: 0.559
polyamorous: 0.559
sexually_compatible: 0.556
infatuate: 0.554
monogamy: 0.554
rekindle: 0.553
unaddressed: 0.553
forthright: 0.552
ltr: 0.55


In [14]:
get_most_similar_terms(model, 'power')

flex: 0.463
power_imbalance: 0.433
tripping: 0.428
wield: 0.423
intimidation: 0.416
differential: 0.413
seniority: 0.411
dominance: 0.394
tactic: 0.39
institutional: 0.388
ideological: 0.385
authority: 0.382
deity: 0.382
dictatorship: 0.375
inequity: 0.371
enrich: 0.371
perpetual: 0.371
underhanded: 0.37
invaluable: 0.37
submissive: 0.37


In [15]:
get_most_similar_terms(model, 'man')

woman: 0.649
men: 0.62
subservient: 0.554
butch: 0.553
macho: 0.541
cisgender: 0.539
femme: 0.538
misandry: 0.536
transman: 0.535
masculine: 0.528
ogle: 0.526
shouldn't: 0.525
hes: 0.525
patriarchy: 0.52
women: 0.52
dude: 0.519
het: 0.516
androgynous: 0.515
tomboy: 0.515
ntabee: 0.514


In [16]:
get_most_similar_terms(model, 'woman')

man: 0.649
men: 0.599
women: 0.594
transwoman: 0.585
androgynous: 0.584
transman: 0.582
womanhood: 0.567
butch: 0.565
cisgender: 0.562
misandry: 0.55
ogle: 0.545
transpeople: 0.54
monolith: 0.536
femme: 0.535
shouldn't: 0.535
oggle: 0.53
conventionally: 0.528
fetishization: 0.527
middle_aged: 0.522
womens: 0.522


# Visualizing High Dimensional Spaces with $t$-SNE

The word embeddings we created are what's called a **high-dimensional representation** of the text. That is, we take a word in the corpus, and represent it using, in this case, 300 numbers. We can plot 3 numbers at a time - that's in 3 dimensions - but there's no way for humans to visualize something in a 300-dimensional space. 

So, **dimensionality reduction** is a big part of machine learning. How can we take vectors that are 300-dimensional, and visualize them in 2-dimensions, while keeping the structure between vectors the same? How can we reduce the dimensionality?

One of the most popular methods for dimensionality reduction is called $t$-SNE ($t$-Distributed Stochastic Neighbor Embedding). If you want to read more about this algorithm, [here](https://lvdmaaten.github.io/tsne/) is a good starting point. 

Roughly, $t$-SNE tries to keep the relative distances between points as closely as possible in both high-dimensional and low-dimensional space. We can thus visualize our embeddings, which may reveal semantic and syntactic trends in the data.

One thing we can do is to look for relationships between particular word pairs. 

In [17]:
words = ['asshole', 'jerk', 'rude', 'inconsiderate',
         'angry', 'upset', 'sad', 'happy', 'frustrated',
         'friend', 'family', 'partner', 'coworker', 'neighbor',
         'apologize', 'forgive', 'ignore', 'confront', 'compromise',
         'right', 'wrong', 'justified', 'unjustified', 'fair', 'unfair',
         'honesty', 'loyalty', 'respect', 'kindness', 'empathy']

# Extract the word vectors
word_vectors = np.array([model.wv[word] for word in words])

In [18]:
# If you get an ImportError in the line tsne=TSNE(), you might need to install scikit-learn:
# %pip install -U scikit-learn 

In [19]:
# Reduce dimensionality using t-SNE
tsne = TSNE(n_components=2, random_state=2, perplexity=2)
reduced_vectors = tsne.fit_transform(word_vectors)

In [20]:
# Store the t-SNE vectors
words_df = pd.DataFrame(reduced_vectors,
                            index=pd.Index([word for word in words]),
                            columns=['x', 'y'])

In [25]:
from bokeh.plotting import ColumnDataSource, figure, output_file, show
from bokeh.models import HoverTool

# Add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(words_df)

# Create the plot and configure the title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings')

# Add a hover tool to display words on roll-over
tsne_plot.add_tools(HoverTool(tooltips='@index'))

# Draw the words as circles on the plot
tsne_plot.circle('x', 'y',
                 source=plot_data,
                 color='blue',
                 size=10,
                 hover_line_color='black')

# Add labels to the points
labels = LabelSet(x='x', y='y', text='index', level='glyph',
                  x_offset=5, y_offset=5, source=plot_data,
                  render_mode='canvas')
tsne_plot.add_layout(labels)

# Engage!
show(tsne_plot)

Now let's use $t$-SNE to take **all** the word vectors.

In [26]:
tsne = TSNE(init='pca', learning_rate='auto')
X_tsne = tsne.fit_transform(model.wv.vectors)

We have our low dimensional representation. Now, let's store the 2 dimensions in a dataframe, with the word as the index:

In [27]:
# Store the t-SNE vectors
tsne_df = pd.DataFrame(X_tsne,
                            index=pd.Index(model.wv.index_to_key),
                            columns=['x', 'y'])

In [28]:
# Create some filepaths to save our model
tsne_path = '../../data/tsne_model'
tsne_df_path = '../../data/tsne_df.pkl'

In [29]:
# Save to disk
with open(tsne_path, 'wb') as f:
    pickle.dump(X_tsne, f)

tsne_df.to_pickle(tsne_df_path)

Here's a convenient code block to load this data, to start from this point:

In [30]:
with open(tsne_path, 'rb') as f:
    X_tsne = pickle.load(f)
    
tsne_df = pd.read_pickle(tsne_df_path)

We're going to visualize the 2-dimensional space using a package called `bokeh`. This package allows for some degree of interactivity: we can go over each point and dynamically get information about the word denoting that vector.

In [31]:
import bokeh
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource

output_notebook()
bokeh.io.output_notebook()

In [32]:
# Add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_df)

# Create the plot and configure the title, dimensions, and tools
tsne_plot = figure(title='t-SNE Word Embeddings')

# Add a hover tool to display words on roll-over
tsne_plot.add_tools(HoverTool(tooltips='@index') )

# Draw the words as circles on the plot
tsne_plot.circle('x', 'y',
                 source=plot_data,
                 color='blue',
                 line_alpha=0.2,
                 fill_alpha=0.1,
                 size=10,
                 hover_line_color='black')

# Engage!
show(tsne_plot)

<a id='bias'></a>

# Discovering Language Bias With Word Embeddings

The vector representations of word Embeddings can be used for all kinds of tasks. One such task is discovering **language bias**: language that is normatively oriented towards certain people, ideas, or things. 

In order to do this, we first need to postulate **concepts** that we suspect a bias towards. For example, people often associate words with masculinity or femininity (think of "fireman" vs "nurse"). This reflects both a normative construction of a gender binary, as well as stereotypes *within* that binary. We can attempt to quantify this directly using notions of distance with word embeddings.

We first define **target concepts** for the terms we aim to identify biases towards (e.g., "male" and "female"). Then, we can then compute relative similarities of other word vectors – particularly, words that act as evaluative attributes such as "strong" and "sensitive". These words can be categorised through clustering algorithms and labeled through a semantic analysis system into more general (conceptual) biases, yielding a broad picture of the biases present in a discourse community.

At the bottom of this notebook you'll find some [existing target sets](#target) that have been used in the literature on language biases.

## Obtaining Biases Towards Target Concepts

We will run our method of finding biased words towards our target sets. We will be using functions in an external file - `utils.py` - which you are free to look through if you're interested. Otherwise, feel free to use the functions as desired.

Given a vocabulary and two sets of target words (such as, in this case, those for *women* and *men*), we rank the words from least to most biased. As such, we obtain two ordered lists of the most biased words towards each target set, obtaining an overall view of the bias distribution in that particular community with respect to those two target sets. 

The approach is as follows:
- We calculate the centroid of a target set by averaging the embedding vectors in our target set (e.g. the vectors for `he, son, his, him, father, male` for our target concept `male`);
- We calculate the cosine similarity between the vectors for all words in our vocabulary as compared to our two centroids (we also apply POS-filtering to only work with parts of speech we expect to be relevant);
- We use a threshold based on standard deviation to determine how severe a bias needs to be before we include it;
- We rank the words in the vocabulary of our word embeddings model based on their bias towards either target concept.

The implicit hypothesis here is that word vectors closer to the centroid of a target concept are used more similarly with respect that concept. This, in effect, is an effort to quantify the bias of words in the discourse of a community.

In [33]:
# Import function to calculate biased words
from utils import calculate_biased_words

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tomvannuenen/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/tomvannuenen/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Here are our target concept vectors. This is a critical choice: the centroid of these represents the concept at hand. What happens if you change the words included?

In [34]:
target1 = ['sister' , 'female' , 'woman' , 'girl' , 'daughter' , 'she' , 'hers' , 'her']
target2 = ['brother' , 'male' , 'man' , 'boy' , 'son' , 'he' , 'his' , 'him'] 

In [35]:
model = Word2Vec.load('../../data/aita_comments.emb')

In [36]:
[b1, b2] = calculate_biased_words(model, target1, target2, 4)

Let's print some biases. Here you see the most-biased words towards our target concepts (1 being *women*, 2 being *men*).

In [37]:
print('Biased words towards target set 1')
print([word for word in b1.keys()])

Biased words towards target set 1
['body', 'mention', 'healthy', 'involve', 'picture', 'shame', 'describe', 'context', 'creepy', 'approach', 'detail', 'photo', 'zero', 'gym', 'sensitive', 'gain', 'painful', 'hurtful', 'burn', 'social_medium', 'delivery', 'tone', 'commenter', 'overweight', 'intent', 'bridesmaid', 'frame', 'vulnerable', 'pic', 'chick', 'employer', 'shallow', 'skinny', 'thin', 'model', 'self_esteem', 'involved', 'cross_line', 'dodge_bullet', 'nerve', 'childbirth', 'self_conscious', 'unattractive', 'temporary', 'invasive', 'snapchat', 'risky', 'virginity', 'workout', 'hookup', 'shaming', 'superficial', 'douchey']


In [38]:
print('Biased words towards target set 2')
print([word for word in b2.keys()] )

Biased words towards target set 2
['responsibility', 'enjoy', 'cook', 'stick', 'worry', 'forget', 'luck', 'argue', 'win', 'plenty', 'split', 'separate', 'bar', 'trash', 'bag', 'household', 'vote', 'kitchen', 'buddy', 'cake', 'btw', 'realise', 'mate', 'uncle', 'grandfather', 'spectrum', 'lord', 'grandpa', 'grand', 'scout', 'collect', 'dear', 'fuss', 'gene', 'hunt', 'poverty', 'console', 'xbox']


## Visualizing Biases using $t$-SNE

We now return to our dimensionality reduction technique, $t$-SNE, to try and visualize these biased words in a 2D space using Bokeh.

In [39]:
import matplotlib.pyplot as plt
import numpy as np

from sklearn.manifold import TSNE
%matplotlib inline

In [40]:
with open(tsne_path, 'rb') as f:
    X_tsne = pickle.load(f)
    
tsne_df = pd.read_pickle(tsne_df_path)

In [41]:
# Convert biased term keys to arrays
target1_idx = np.array([model.wv.key_to_index[key] for key in b1.keys()])
target2_idx = np.array([model.wv.key_to_index[key] for key in b2.keys()])

In [42]:
# Find t-sne values for the biased sets
X_target1 = X_tsne[target1_idx]
X_target2 = X_tsne[target2_idx]

In [43]:
from bokeh.io import show, output_notebook, output_file
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet

# Set up the Bokeh plot
output_notebook()

p = figure()

# Create ColumnDataSource for X_target1 (blue)
source1 = ColumnDataSource(data=dict(x=X_target1[:, 0], y=X_target1[:, 1], label=[model.wv.index_to_key[idx] for idx in target1_idx]))

# Create ColumnDataSource for X_target2 (red)
source2 = ColumnDataSource(data=dict(x=X_target2[:, 0], y=X_target2[:, 1], label=[model.wv.index_to_key[idx] for idx in target2_idx]))

# Add scatter plot for X_target1 (blue)
p.scatter(x='x', y='y', color='blue', size=8, source=source1)

# Add scatter plot for X_target2 (red)
p.scatter(x='x', y='y', color='red', size=8, source=source2)

# Add labels for X_target1
labels1 = LabelSet(x='x', y='y', text='label', x_offset=6, y_offset=3, source=source1, render_mode='canvas')
p.add_layout(labels1)

# Add labels for X_target2
labels2 = LabelSet(x='x', y='y', text='label', x_offset=6, y_offset=3, source=source2, render_mode='canvas')
p.add_layout(labels2)

# Show the plot
show(p)

## 💭 Reflection

Note that these binary target concepts are often a product of ideology and normativity in society: the gender binary is a good example. When checking for biases towards certain concepts, make sure you consider the fact that you are the one creating / reproducing these concepts, and that you may be reinforcing a constructed binary!

Also note that determining your own target concepts and biases is a **iterative** process. Try changing some of the words in the target concepts to see how the biased words and plot change, and discuss with your group what you think makes for a coherent and robust target set.

<a id='target'></a>
# Existing Target Sets

Here are some other target sets that have been previously used in the literature:

* *Gender target sets taken from Nosek, Banaji, and Greenwald 2002.*
    - Female: `sister, female, woman, girl, daughter, she, hers, her`.
    - Male: `brother, male, man, boy, son, he, his, him`.
* *Religion target sets taken from Garg et al. 2018.*
    - Islam: `allah, ramadan, turban, emir, salaam, sunni, koran, imam, sultan, prophet, veil, ayatollah, shiite, mosque, islam, sheik, muslim, muhammad`.
    - Christianity: `baptism, messiah, catholicism, resurrection, christianity, salva-tion, protestant, gospel, trinity, jesus, christ, christian, cross,catholic, church`.
* *Racial target sets taken from Garg et al. 2017*
    - White last names: `harris, nelson, robinson, thompson, moore, wright, anderson, clark, jackson, taylor, scott, davis, allen, adams, lewis, williams, jones, wilson, martin, johnson`.
    - Hispanic last names: `ruiz, alvarez, vargas, castillo, gomez, soto,gonzalez, sanchez, rivera, mendoza, martinez, torres, ro-driguez, perez, lopez, medina, diaz, garcia, castro, cruz`.
    - Asian last names: `cho, wong, tang, huang, chu, chung, ng,wu, liu, chen, lin, yang, kim, chang, shah, wang, li, khan,singh, hong`.
    - Russian last names: `gurin, minsky, sokolov, markov, maslow, novikoff, mishkin, smirnov, orloff, ivanov, sokoloff, davidoff, savin, romanoff, babinski, sorokin, levin, pavlov, rodin, agin`.
* *Career/family target sets taken from Garg et al. 2018.*
    - Career: `executive, management, professional, corporation, salary, office, business, career`.
    - Family: `home, parents, children, family, cousins, marriage, wedding, relatives.Math: math, algebra, geometry, calculus, equations, computation, numbers, addition`.
* *Arts/Science target sets taken from Garg et al. 2018.*
    - Arts: `poetry, art, sculpture, dance, literature, novel, symphony, drama`.
    - Science: `science, technology, physics, chemistry, Einstein, NASA, experiment, astronomy`.

### Sources

Nosek, B. A., Banaji, M. R., & Greenwald, A. G. (2002). Harvesting implicit group attitudes and beliefs from a demonstration web site. Group Dynamics, 6(1), 101–115. https://doi.org/10.1037/1089-2699.6.1.101

Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2017). Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes, 1–33.