# Spanish Counterfactuals
## Background

This notebook is the second part of a research project into gender bias in language models. Spanish has grammatical gender and different words for male and female professions (el doctor nuevo / la doctora nueva), so language models encode these words as separate points.  This is common in many languages, and doesn't necessarily introduce bias.

**I chose Spanish only because I can understand and translate examples.**

My goal is to adapt existing metrics for measuring bias, and propose more reusable solutions around gender bias in not-English NLP.

# Proposal

Now it's time to talk solutions: I would like to evaluate models with an original sentence and a gender-flipped sentence. We can then test whether the outcome of any model is changed.

**Can word embeddings be applied to flip gender in Spanish?**

# Flipping individual words

Let's load the BETO pretrained embeddings using HuggingFace's Transformers module.

In [0]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cd/38/c9527aa055241c66c4d785381eaf6f80a28c224cae97daa1f8b183b5fabb/transformers-2.9.0-py3-none-any.whl (635kB)
[K     |████████████████████████████████| 645kB 2.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 8.1MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 40.0MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/3b/88/49e772d686088e1278766ad68a463513642a2a877487decbd691dec02955/sentencepiece-0.1.90-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |█████

In [0]:
from transformers import AutoTokenizer, BertModel
tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
model = BertModel.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(31002, 768, padding_idx=1)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

There are some methods to convert whole sentences to PyTorch tensors, but let's focus on individual words. Each word will come out of the language model as a 768-dimensional vector, which we can compare to each other 

In [0]:
def embedding_for_word(word):
  id = tokenizer.encode(word)[1]
  return model.embeddings.word_embeddings.weight[id].detach().numpy()

## Calculate vector differences between words

Let's calculate ```article_diff``` between el and la

In [0]:
# "el" minus "la"
article_diff = embedding_for_word("el") - embedding_for_word("la")
print(article_diff)

[-1.01193652e-01 -9.56837833e-03  1.38982348e-02 -2.95504611e-02
 -4.62328494e-02 -3.54627706e-03 -3.15616056e-02  2.03622542e-02
 -3.60682607e-05  1.55523065e-02 -1.83134079e-02  1.56444181e-02
  4.51911502e-02 -2.40496062e-02 -3.26022282e-02 -4.34471518e-02
  3.63821909e-03 -3.73945720e-02 -1.38825476e-02  3.89223397e-02
  1.16123660e-02 -1.03008505e-02 -3.97766568e-03 -3.00680920e-02
  1.86911654e-02  7.11152926e-02  1.40250064e-02 -1.30725764e-02
  2.60123964e-02 -1.65068321e-02  7.43721500e-02 -1.97625123e-02
  5.68020567e-02 -1.81963909e-02  2.46390384e-02 -4.71784100e-02
 -1.56514253e-02 -3.83357853e-02  1.88288130e-02 -3.86340804e-02
  5.18993661e-03  3.23170684e-02  1.57557055e-02  3.41297276e-02
  2.10301094e-02  4.10606004e-02 -1.54591165e-04 -7.13625085e-03
 -7.02920230e-03 -5.48075140e-03 -4.54044901e-04 -1.09942835e-02
  1.07189910e-02  2.06957944e-02  1.50508285e-02 -4.16118652e-04
  5.66911185e-05  4.13249061e-02 -1.60544291e-02 -1.30373891e-02
  1.46066807e-02 -5.94341

Let's calculate a ```noun_diff``` vector between a male and female profession (such as maestro/maestra)

In [0]:
# "maestro" - "maestra"
noun_diff = embedding_for_word("maestro") - embedding_for_word("maestra")
print(noun_diff)

[-3.59191820e-02 -9.68925003e-03  1.42796785e-02 -9.90152545e-03
  6.92520216e-02 -3.00704166e-02 -6.19686171e-02  1.14616267e-02
 -1.72810331e-02  1.39672980e-02  3.81249674e-02  1.35235228e-02
 -3.24731506e-02  3.27836163e-03 -9.65690147e-03 -3.12734097e-02
  3.68267298e-02  3.58559191e-02  8.01959634e-03  3.57991792e-02
  8.31423178e-02 -5.37066087e-02  1.66588556e-02  9.48078930e-03
  2.42801681e-02  4.36856858e-02  3.07860076e-02  5.66809773e-02
 -2.23038439e-02  1.80876181e-02  1.41317472e-02  5.07193059e-03
  1.09874308e-02 -1.31353280e-02  3.02872658e-02  2.42619384e-02
 -1.82333421e-02  7.79369920e-02 -2.75658667e-02 -4.51937839e-02
  4.20431532e-02 -1.06230872e-02  1.15213096e-01 -3.30538377e-02
  5.36187887e-02  4.96502407e-03  2.27976888e-02 -4.73800227e-02
  3.71647961e-02  6.89722896e-02 -2.12735981e-02 -3.60166579e-02
 -3.63069773e-03  3.16753313e-02 -8.59868713e-03  4.13877517e-03
  1.91910006e-02  6.44873828e-03  4.26523313e-02  5.76551221e-02
  9.25509725e-03  1.97832

## Finding closest word (cosine similarity)

WEAT measures closeness of words by cosine similarity (vector in same direction, regardless of magnitude), and not by distance in multidimensional space. We're going to continue using that method to find the closest new word to our calculated vector

In [0]:
import numpy as np

def cosine_similarity(vec1, vec2):
    len1 = np.linalg.norm(vec1)
    len2 = np.linalg.norm(vec2)
    dot_product = np.dot(vec1, vec2)
    return dot_product / (len1 * len2)

In [0]:
def closest_word(word, diff, printme=True):
  # we have to take the original word out of contention; diff may be too small otherwise
  original_id = tokenizer.encode(word)[1]

  # make the diff adjustment
  encoded = embedding_for_word(word)
  new_word = encoded + diff

  mostSim = 0
  leastDistWord = -1
  index = 0

  for word in model.embeddings.word_embeddings.weight:
    dist = cosine_similarity(new_word, word.detach().numpy())
    if (dist > mostSim) and (index > 6) and (index != original_id):
      mostSim = dist
      leastDistWord = index
    index += 1
  if printme:
    print(mostSim)

  return tokenizer.decode([leastDistWord])

Here we **take the article "una"** and reuse the ```article_diff``` vector to reach the male "un". Interestingly, the offset vector is very small and (in this uncased model) a diff of 0 would also give us "un" as the next available word

In [0]:
print(closest_word("una", 1 * article_diff))
print(closest_word("una", 0 * article_diff))

0.74033797
un
0.7071701
un


In [0]:
print(closest_word("las", 1 * noun_diff))
print(closest_word("los", -1 * noun_diff))

0.5104903
los
0.506704
las


**Trying some nouns**

In [0]:
print(closest_word("compañera", 1 * noun_diff))

0.65458435
compañero


In [0]:
print(closest_word("doctor", -1 * noun_diff))
print(closest_word("doctora", 1 * noun_diff))

0.53279793
doctora
0.51159954
doctor


**On a word (library) where there is not a flipped gender word**, we get a capitalized word. I tried this same experiment with an all-lowercase model and biblioteca's next neighbor was the plural, bibliotecas. Unfortunately the all-lowercase model caused other problems with my tests, so I returned to the cased model

In [0]:
print(closest_word("biblioteca", 1 * noun_diff))

0.45826063
Biblioteca


**On plurals** - the answers are there, but the diff vector is hard to come by

In [0]:
plural_diff = embedding_for_word("maestros") - embedding_for_word("maestras")

In [0]:
print(closest_word("chicos", -0.5 * plural_diff))
print(closest_word("hombres", -0.9 * plural_diff))

0.5572855
chicas
0.4703354
hombre


In [0]:
print(closest_word("madres", 0.6 * plural_diff))

0.47073933
padres


Trying it on names

In [0]:
print(closest_word("Paula", 1 * noun_diff))
print(closest_word("Cecilia", 1 * noun_diff))

# Male names flip to Maestra??
print(closest_word("Pablo", -1 * noun_diff))
print(closest_word("Nicolás", -1 * noun_diff))
# same for random text
print(closest_word("d22n9@j", -1 * noun_diff))

0.3014029
Bruno
0.39872557
Claudio
0.48295468
maestra
0.46040207
maestra
0.4352325
maestra


In [0]:
# making a male name more male?
print(closest_word("Eduardo", 1 * noun_diff))

0.37805256
Ricardo


## Flipping names with gender-guesser package
Flipping male -> female names is proving unreliable, so I will use this module to detect names, estimate most likely gender for the name, and then flip to a name from the opposite gender.



In [0]:
! pip install gender-guesser
! wget https://github.com/lead-ratings/gender-guesser/blob/master/gender_guesser/data/nam_dict.txt?raw=true

Collecting gender-guesser
[?25l  Downloading https://files.pythonhosted.org/packages/13/fb/3f2aac40cd2421e164cab1668e0ca10685fcf896bd6b3671088f8aab356e/gender_guesser-0.4.0-py2.py3-none-any.whl (379kB)
[K     |▉                               | 10kB 20.4MB/s eta 0:00:01[K     |█▊                              | 20kB 1.5MB/s eta 0:00:01[K     |██▋                             | 30kB 1.8MB/s eta 0:00:01[K     |███▌                            | 40kB 2.0MB/s eta 0:00:01[K     |████▎                           | 51kB 1.9MB/s eta 0:00:01[K     |█████▏                          | 61kB 2.1MB/s eta 0:00:01[K     |██████                          | 71kB 2.3MB/s eta 0:00:01[K     |███████                         | 81kB 2.5MB/s eta 0:00:01[K     |███████▊                        | 92kB 2.5MB/s eta 0:00:01[K     |████████▋                       | 102kB 2.6MB/s eta 0:00:01[K     |█████████▌                      | 112kB 2.6MB/s eta 0:00:01[K     |██████████▍                     | 12

In [0]:
import gender_guesser.detector as gender
d = gender.Detector()

In [0]:
print(d.get_gender("Pablo"))
print(d.get_gender("Paula"))
print(d.get_gender("Juan"))
print(d.get_gender("José"))
print(d.get_gender("Ashley"))

male
female
male
male
mostly_female


### Using the underlying data to generate names

Luckily gender-guesser includes information about popularity of names in Spain and a few other countries. We can then generate a male or female name from a list, and likely find it in the BETO model.

In [0]:
recc_names = {'M': [], 'F': []}

with open("nam_dict.txt?raw=true", "r") as names:
  found_names = 0
  for name in names:
    if name[0] == "#":
      # readme
      continue
    spanish_pop = name[36]
    if spanish_pop != " " and spanish_pop > "3":
      #print(spanish_pop)
      conventional_binary_gender = name.split(' ')[0]
      if conventional_binary_gender in ['M', 'F']:
        name = name.split(' ')[2]
        recc_names[conventional_binary_gender].append(name)

In [0]:
print(recc_names['M'][0:10])

['Adolfo', 'Adrián', 'Agustín', 'Alberto', 'Alejandro', 'Alfonso', 'Alfredo', 'Álvaro', 'Amador', 'Anastasio']


# Multilingual BERT
I've wondered if it would make sense to use mBERT instead of BETO for this task. Here I'm using the all-lowercase model

In [0]:
from transformers import BertTokenizer
tokenizer2 = BertTokenizer.from_pretrained("bert-base-multilingual-uncased")

model2 = BertModel.from_pretrained("bert-base-multilingual-uncased")
model2.eval()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=871891.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=672271273.0, style=ProgressStyle(descri…




BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(105879, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
         

In [0]:
def embedding_for_bert(word):
  id = tokenizer2.encode(word)[1]
  return model2.embeddings.word_embeddings.weight[id].detach().numpy()

In [0]:
def closest_in_bert(word, diff):
  original_id = tokenizer2.encode(word)[1]
  encoded = embedding_for_bert(word)
  new_word = encoded + diff

  leastDist = 100
  leastDistWord = -1
  mostSim = 0

  index = 0
  for word in model2.embeddings.word_embeddings.weight:
    dist = cosine_similarity(new_word, word.detach().numpy())
    if (dist > mostSim) and (index > 6) and (index != original_id):
      mostSim = dist
      leastDistWord = index
    index += 1
  print(mostSim)

  return tokenizer2.decode([leastDistWord])

### English language analogies / gender flip

English analogies don't work so well in mBERT

In [0]:
bert_n_diff = embedding_for_bert("man") - embedding_for_bert("woman")

In [0]:
closest_in_bert("king", -1 * bert_n_diff)

0.5023459


'woman'

In [0]:
closest_in_bert("queen", 1 * bert_n_diff)

0.60389835


'man'

### Spanish language analogies / gender flip

Same problem

In [0]:
bert_es_n_diff = embedding_for_bert("hombre") - embedding_for_bert("mujer")

In [0]:
closest_in_bert("rey", -1 * bert_es_n_diff)

0.44842014


'mujer'

In [0]:
gender_es = embedding_for_bert("hombre") - embedding_for_bert("mujer")

This one appears to work; unfortunately ñ is being dropped

In [0]:
closest_in_bert("compañera", 1 * gender_es)

1.0


'companero'

### Translation

Translation works, though!

In [0]:
translate_en_es = embedding_for_bert("biblioteca") - embedding_for_bert("library")

In [0]:
closest_in_bert("escuela", -1 * translate_en_es)

0.529695


'school'

### Reflections

It's unfair to expect perfect single-word changes with sentence-level transformers.

I would like to try this as seq2seq, but it's beyond my level in the here and now

# Parse and flip sentences with spaCy

Here's the strategy: parse Spanish sentences, determine which words need to be flipped, and use BETO to flip their corresponding articles and adjectives.

In [0]:
! pip install --upgrade spacy
! python -m spacy download es_core_news_md

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.2.4)
Collecting es_core_news_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-2.2.5/es_core_news_md-2.2.5.tar.gz (78.4MB)
[K     |████████████████████████████████| 78.4MB 3.2MB/s 
Building wheels for collected packages: es-core-news-md
  Building wheel for es-core-news-md (setup.py) ... [?25l[?25hdone
  Created wheel for es-core-news-md: filename=es_core_news_md-2.2.5-cp36-none-any.whl size=79649483 sha256=970a3d7deec88b36e340626ef6af2b594426ad19005d0327f8ca4c43b1acadef
  Stored in directory: /tmp/pip-ephem-wheel-cache-r9p8u3bg/wheels/b7/bb/a3/29ab5cf80c2c0a8fa0f2af8402fdace3f159e8265f0fdcbcdb
Successfully built es-core-news-md
Installing collected packages: es-core-news-md
Successfully installed es-core-news-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_md')


**Restart runtime after you install that stuff ^^**

## Explore dependency parsing

The key to spaCy here is their dependency parsing on https://spacy.io/usage/linguistic-features

In [0]:
import spacy
nlp = spacy.load("es_core_news_md")

In [0]:
doc1 = nlp("Estamos en nuestra casa.")
for chunk in doc1.noun_chunks:
    print(chunk.text + "\n" + chunk.root.text + "\n" + chunk.root.dep_ + "\n" + chunk.root.head.text)

nuestra casa
casa
ROOT
casa


In [0]:
doc = nlp("La mujer y doctor van a la biblioteca para leer un libro viejo.")
for token in doc:
     print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

def get_gender_deps(sentence):
    pairings = []
    for token in sentence:
        if type(token) != type(''):
            if token.dep_ == "det" or token.dep_ == "amod":
                pairings.append([token.text, token.head.text])
    return pairings

print('---')
print('\n'.join(list(map((lambda line: line[0] + " depends on gender of " + line[1]), get_gender_deps(doc)))))
print('---')
print('\n'.join(list(map((lambda line: line[0] + " depends on gender of " + line[1]), get_gender_deps(doc1)))))


La det mujer NOUN []
mujer nsubj van VERB [La, doctor]
y cc doctor NOUN []
doctor conj mujer NOUN [y]
van ROOT van VERB [mujer, biblioteca, leer, .]
a case biblioteca NOUN []
la det biblioteca NOUN []
biblioteca obj van VERB [a, la]
para mark leer VERB []
leer advcl van VERB [para, libro]
un det libro NOUN []
libro obj leer VERB [un, viejo]
viejo amod libro NOUN []
. punct van VERB []
---
La depends on gender of mujer
la depends on gender of biblioteca
un depends on gender of libro
viejo depends on gender of libro
---
nuestra depends on gender of casa


Here's a sentence directly from Spanish Wikipedia. It doesn't have any words which we'd change, but we can verify that it is parsed correctly.

In [0]:
wikiSentence = "Las personas interesadas consideran que la tendencia de las lenguas a cambiar en su desarrollo natural a través de la historia, permite potencialmente lograr una mayor inclusión social, cuando cierta conciencia social influye sobre los cambios de las lenguas."
get_gender_deps(nlp(wikiSentence))

[['Las', 'personas'],
 ['interesadas', 'personas'],
 ['la', 'tendencia'],
 ['las', 'lenguas'],
 ['su', 'desarrollo'],
 ['natural', 'desarrollo'],
 ['la', 'historia'],
 ['una', 'inclusión'],
 ['mayor', 'inclusión'],
 ['social', 'inclusión'],
 ['cierta', 'conciencia'],
 ['social', 'conciencia'],
 ['los', 'cambios'],
 ['las', 'lenguas']]

These sentences are interesting... they identify "el" is connected to lingüista, but this is more about the name/subject; the article would be "la" for a female linguist.

In the next sentence, "holandés" would change to "holandesa"

In [0]:
wikiSentence2 = "En el siglo XX el lingüista estadounidense Noam Chomsky creó la corriente conocida como generativismo."
get_gender_deps(nlp(wikiSentence2))

[['el', 'siglo'],
 ['el', 'lingüista'],
 ['estadounidense', 'lingüista'],
 ['la', 'corriente'],
 ['conocida', 'corriente']]

In [0]:
wikiSentence3 = "La figura más relevante dentro de esta corriente tal vez sea el lingüista holandés Simon C. Dik, autor del libro Functional Grammar."
get_gender_deps(nlp(wikiSentence3))

[['La', 'figura'],
 ['relevante', 'figura'],
 ['esta', 'corriente'],
 ['el', 'lingüista'],
 ['holandés', 'lingüista']]

The noun "miembro" here does not need to change for a female subject.  This is common for a lot of words.

In [0]:
wikiSentence4 = "Tuvo una influencia decisiva en la creación de la Cruz Roja Británica en 1870, y fue miembro de su comité de damas interesándose por las actividades del movimiento hasta su fallecimiento."
get_gender_deps(nlp(wikiSentence4))

[['una', 'influencia'],
 ['decisiva', 'influencia'],
 ['la', 'creación'],
 ['la', 'Cruz'],
 ['su', 'comité'],
 ['las', 'actividades'],
 ['su', 'fallecimiento']]

In [0]:
wikiSentence5 = "El padre de Fanny (abuelo materno de Florence) fue el abolicionista y unitarista William Smith."
get_gender_deps(nlp(wikiSentence5))

[['El', 'padre'], ['materno', 'abuelo'], ['el', 'abolicionista']]

## Come up with a strategy for flipping words

In [0]:
print(closest_word("El", 0 * article_diff))
print(closest_word("padre", 0 * noun_diff))

0.6407496
La
0.6534688
madre


In [0]:
print(closest_word("materno", 0 * noun_diff))
print(closest_word("abuelo", 0 * noun_diff))

0.60393536
materna
0.6682723
abuela


In [0]:
print(closest_word("el", 0 * article_diff))
print(closest_word("abolicionista", 0 * noun_diff))

0.63179874
la
0.5781752
abolición


### Flip everything

In [0]:
def flip_sentence(sentence):
    doc = nlp(sentence)
    pairings = get_gender_deps(doc)
    words = []
    for token in doc:
        alt_word = None
        if token.pos_ == "NOUN":
            alt_word = closest_word(token.text, 0.5 * noun_diff, printme=False)
        elif len(pairings) > 0 and token.text == pairings[0][0]:
            diff = noun_diff
            if token.text.lower() in ['el', 'la', 'los', 'las', 'un', 'una', 'unos', 'unas', 'estas', 'estes', 'estos', 'aquello', 'aquella', 'aquellos', 'aquellas']:
              diff = article_diff
            alt_word = closest_word(token.text, 1 * diff, printme=False)
            pairings = pairings[1:]
        if alt_word is None or alt_word.lower() == token.text.lower():
            words.append(token.text)
        else:
            words.append(alt_word)
    return ' '.join(words)

In [0]:
flip_sentence("Montevideo es la ciudad más importante en lo que a deportes se refiere de todo el Uruguay")

'Montevideo es el ciudades más importantes en lo que a deporte se refiere de todo el Uruguay'

### Don't flip if closest neighbor to noun is a plural

In [0]:
def flip_2(sentence):
    doc = nlp(sentence)
    pairings = get_gender_deps(doc)
    words = []
    for token in doc:
        alt_word = None
        if token.pos_ == "NOUN":
            if 'Gender=Masc' in token.tag_:
              alt_word = closest_word(token.text, -0.5 * noun_diff, printme=False)
            else:
              alt_word = closest_word(token.text, 0.5 * noun_diff, printme=False)
            alt_nlp = nlp(alt_word)[0]

            if ('NOUN__Gender=Fem' in token.tag_ and 'NOUN__Gender=Fem' in alt_nlp.tag_):
              alt_word = None
            elif ('NOUN__Gender=Masc' in token.tag_ and 'NOUN__Gender=Masc' in alt_nlp.tag_):
              alt_word = None
            
        elif len(pairings) > 0 and token.text == pairings[0][0]:
            diff = noun_diff
            if token.text.lower() in ['el', 'la', 'los', 'las', 'un', 'una', 'unos', 'unas', 'estas', 'estes', 'estos', 'aquello', 'aquella', 'aquellos', 'aquellas']:
              diff = article_diff
            

            dep_noun_nlp = nlp(pairings[0][1])
            if 'Gender=Masc' in dep_noun_nlp[0].tag_:
              alt_noun = closest_word(pairings[0][1], -0.5 * noun_diff, printme=False)
            else:
              alt_noun = closest_word(pairings[0][1], 0.5 * noun_diff, printme=False)
            alt_noun_nlp = nlp(alt_noun)

            if ('NOUN__Gender=Fem' in dep_noun_nlp[0].tag_ and 'NOUN__Gender=Fem' in alt_noun_nlp[0].tag_):
              alt_word = None
            elif ('NOUN__Gender=Masc' in dep_noun_nlp[0].tag_ and 'NOUN__Gender=Masc' in alt_noun_nlp[0].tag_):
              alt_word = None
            elif ('PROPN_' in dep_noun_nlp[0].tag_):
              alt_word = None
            else:
              if 'Gender=Masc' in token.tag_:
                alt_word = closest_word(token.text, -1 * diff, printme=False)
              else:
                alt_word = closest_word(token.text, 1 * diff, printme=False)
            pairings = pairings[1:]
        if alt_word is None or alt_word.lower() == token.text.lower():
            words.append(token.text)
        else:
            words.append(alt_word)
    return ' '.join(words)

In [0]:
flip_2("Montevideo es la ciudad más importante en lo que a deportes se refiere de todo el Uruguay")

'Montevideo es la ciudad más importante en lo que a deportes se refiere de todo el Uruguay'

In [0]:
flip_2("El padre de Fanny (abuelo materno de Florence) fue el abolicionista y unitarista William Smith.")

'La madre de Fanny ( abuela materna de Florence ) fue la abolición y unitarista William Smith .'

Some unusual lateral jumps here: different years, "House" and "Chase", a synonym for rivalidad (because it did have opposing gender)

In [0]:
flip_2("Uno de los aspectos más destacados de la rivalidad fue la llamada Maldición del Bambino, un periodo de 86 años (1918-2004) en el que los Red Sox no ganaron ni una vez la Serie Mundial.")

'Uno de los aspectos más destacados de el enfrentamientos fue el llamado Maldición del Bambino , un periodo de 86 años ( 1916 - 2005 ) en la que los Red Sox no ganaron ni una vez la Serie Mundial .'

In [0]:
flip_2("El Dr. House a menudo confronta con su jefa, la administradora del hospital Dra. Lisa Cuddy (Lisa Edelstein), y a su equipo de diagnóstico, debido a la gran cantidad de hipótesis que surgen con respecto a la enfermedad del paciente basadas en finas o controvertidas perspicacias.")

'La Dr . Chase a veces confronta con sus jefe , el administradores del hospital Dra . Lisa Cuddy ( Lisa Edelstein ) , y a su equipo de diagnóstico , debido a la gran cantidad de hipótesis que surgen con respecto a la enfermedad del pacientes basados en fino o controvertidas perspicacias .'

## Correct errors with numbers, verbs, etc.

In [0]:
def better_gender_deps(sentence):
    pairings = []
    for token in sentence:
        if type(token) != type(''):
            if token.dep_ == "det" or token.dep_ == "amod":
                pairings.append([token, token.head])
    return pairings

def flip_noun(og_token):
    if 'Gender=Masc' in og_token.tag_:
      alt_word = closest_word(og_token.text, -0.5 * noun_diff, printme=False)
    else:
      alt_word = closest_word(og_token.text, 0.5 * noun_diff, printme=False)

    print(og_token.text + " -> " + alt_word)

    alt_nlp = nlp(alt_word)[0]
    if ((alt_nlp.pos_ != 'NOUN') or # don't allow change to a verb
        ('NOUN__Gender=Fem' in og_token.tag_ and 'NOUN__Gender=Fem' in alt_nlp.tag_) or
        ('NOUN__Gender=Masc' in og_token.tag_ and 'NOUN__Gender=Masc' in alt_nlp.tag_)): # or
        #(og_token.lemma_ not in alt_nlp.lemma_)):
        alt_word = None
    return alt_word

def flip_3(sentence):
    sentence = sentence.replace('Dr.', 'Doctor').replace('Dra.', 'Doctora').replace('Sr.', 'Señor').replace('Sra.', 'Señora').replace('Srta.', 'Señorita')
    doc = nlp(sentence)
    pairings = better_gender_deps(doc)
    words = []
    for token in doc:
        alt_word = None

        if token.pos_ == "NOUN":
            if 'AdvType=Tim' not in token.tag_: # don't change years
              alt_word = flip_noun(token)
              if alt_word is not None:
                print(token.text + " ? " + str(alt_word))
            
        elif len(pairings) > 0 and token.text == pairings[0][0].text:
            diff = noun_diff
            if token.text.lower() in ['el', 'la', 'los', 'las', 'un', 'una', 'unos', 'unas', 'estas', 'estes', 'estos', 'aquello', 'aquella', 'aquellos', 'aquellas']:
              diff = article_diff
            
            dep_noun_token = pairings[0][1]
            if (('PROPN_' not in dep_noun_token.tag_) and ('Gender' in token.tag_)):
              alt_noun = flip_noun(dep_noun_token)
              if alt_noun is not None: # don't change ADJ if the noun would not change
                print(token.text + " ? " + str(alt_noun))
                alt_noun_nlp = nlp(alt_noun)
                if 'Gender=Masc' in token.tag_:
                  alt_word = closest_word(token.text, -0.6 * diff, printme=False)
                else:
                  alt_word = closest_word(token.text, 0.6 * diff, printme=False)
            pairings = pairings[1:]
        if alt_word is None or alt_word.lower() == token.text.lower():
            words.append(token.text)
        else:
            words.append(alt_word)
    return ' '.join(words)

In [0]:
flip_3("El Dr. House a menudo confronta con su jefa, la administradora del hospital Dra. Lisa Cuddy (Lisa Edelstein), y a su equipo de diagnóstico, debido a la gran cantidad de hipótesis que surgen con respecto a la enfermedad del paciente basadas en finas o controvertidas perspicacias.")

menudo -> veces
menudo ? veces
jefa -> jefe
jefa ? jefe
administradora -> administradores
la ? administradores
administradora -> administradores
administradora ? administradores
hospital -> Hospital
equipo -> equipos
diagnóstico -> diagnos
cantidad -> cantidades
cantidad -> cantidades
hipótesis -> teorías
respecto -> Respecto
respecto ? Respecto
enfermedad -> enfermedades
enfermedad -> enfermedades
paciente -> pacientes
paciente ? pacientes
paciente -> pacientes
basadas ? pacientes
finas -> fino
perspicacias -> persecución


'El Doctor House a veces confronta con su jefe , el administradores del hospital Doctora Lisa Cuddy ( Lisa Edelstein ) , y a su equipo de diagnóstico , debido a la gran cantidad de hipótesis que surgen con respecto a la enfermedad del pacientes basados en finas o controvertidas perspicacias .'

In [0]:
flip_3("En 2004 su creador David Shore y los productores ejecutivos Katie Jacobs y Paul Attanasio, le presentaron a la cadena televisiva Fox Broadcasting Company")

'En 2004 su creador David Shore y los productores ejecutivos Katie Jacobs y Paul Attanasio , le presentaron a la cadena televisiva Fox Broadcasting Company'

In [0]:
flip_3("Durante las audiciones, el actor británico Hugh Laurie se encontraba en las filmaciones de la película El vuelo del Fénix en Namibia.")

el ? actriz
actor ? actriz
británico ? actriz


'Durante las audiciones , la actriz británica Hugh Laurie se encontraba en las filmaciones de la película El vuelo del Fénix en Namibia .'

## Final proper noun / name flip

In [0]:
pronoun_diff = embedding_for_word("nosotros") - embedding_for_word("nosotras")

In [0]:
import random 

import gender_guesser.detector as gender
d = gender.Detector()

In [0]:
def pub_gender_deps(sentence):
    pairings = []
    for token in sentence:
        if type(token) != type(''):
            if token.dep_ == "det" or token.dep_ == "amod":
                pairings.append([token, token.head])
    return pairings

def pub_flip_noun(og_token):
    if og_token.pos_ == "PRON": # pronouns hit different (nosotros/nosotras)
      diff = 1 * noun_diff #pronoun_diff
    else:
      diff = 0.5 * noun_diff

    if (og_token.text.lower() == "nosotros"): # hardcoded b/c of weird spaCy parsing on upper/lower case
      return 'nosotras'
    elif (og_token.text.lower() == "nosotras"):
      return 'nosotros'
    elif 'Gender=Masc' in og_token.tag_:
      # flip to feminine
      alt_word = closest_word(og_token.text, -1 * diff, printme=False)
    elif (og_token.pos_ == "NOUN") and ("Gender" not in og_token.tag_):
      # el/la lingüista - word stays the same but adj. should change
      # don't try this with pronouns; I dunno how they are tagged
      alt_word = og_token.text
    else:
      # flip to masculine
      alt_word = closest_word(og_token.text, 1 * diff, printme=False)

    alt_nlp = nlp(alt_word)[0]
    if ((alt_nlp.pos_ not in ['NOUN', 'PRON']) or # don't allow change to a verb
        ('NOUN__Gender=Fem' in og_token.tag_ and 'NOUN__Gender=Fem' in alt_nlp.tag_) or
        ('NOUN__Gender=Masc' in og_token.tag_ and 'NOUN__Gender=Masc' in alt_nlp.tag_)): # or
        #(og_token.lemma_ not in alt_nlp.lemma_)):
        alt_word = None
    return alt_word

def pub_flip(sentence):
    sentence = sentence.replace('Dr.', 'Doctor').replace('Dra.', 'Doctora').replace('Sr.', 'Señor').replace('Sra.', 'Señora').replace('Srta.', 'Señorita')
    doc = nlp(sentence)
    pairings = pub_gender_deps(doc)
    words = []
    just_saw_proper_noun = False

    for token in doc:
        alt_word = None

        if token.pos_ == "PROPN" and not just_saw_proper_noun: # swap first names
            conventional_binary_gen = d.get_gender(token.text)
            just_saw_proper_noun = True # don't change Hugh Laurie's last name just b.c. it could be a female first name
            if 'female' in conventional_binary_gen:
              alt_word = random.choice(recc_names['M'])
            elif 'male' in conventional_binary_gen:
              alt_word = random.choice(recc_names['F'])
            # leave ambiguous or unknown names alone

        else:
            just_saw_proper_noun = False

            if token.pos_ == "NOUN" or token.pos_ == "PRON":
              if 'AdvType=Tim' not in token.tag_: # don't change years
                alt_word = pub_flip_noun(token)
            
            elif len(pairings) > 0 and token.text == pairings[0][0].text:
                diff = noun_diff
                if token.text.lower() in ['el', 'la', 'los', 'las', 'un', 'una', 'unos', 'unas', 'estas', 'estes', 'estos', 'aquello', 'aquella', 'aquellos', 'aquellas']:
                  diff = article_diff
                
                dep_noun_token = pairings[0][1]
                if (('PROPN_' not in dep_noun_token.tag_) and ('Gender' in token.tag_)):
                  alt_noun = pub_flip_noun(dep_noun_token)
                  if alt_noun is not None: # don't change ADJ if the noun would not change
                    # print(token.text + " ? " + str(alt_noun))
                    alt_noun_nlp = nlp(alt_noun)
                    if 'Gender=Masc' in token.tag_:
                      alt_word = closest_word(token.text, -0.6 * diff, printme=False)
                    else:
                      alt_word = closest_word(token.text, 0.6 * diff, printme=False)
                pairings = pairings[1:]
        if alt_word is None or alt_word.lower() == token.text.lower():
            words.append(token.text)
        else:
            words.append(alt_word)
    return ' '.join(words)

In [0]:
pub_flip("Durante las audiciones, el actor británico Hugh Laurie se encontraba en las filmaciones de la película El vuelo del Fénix en Namibia.")

'Durante las audiciones , la actriz británica Paula Laurie se encontraba en las filmaciones de la película El vuelo del Fénix en Namibia .'

Still not great for some place names.... I thought about not resetting after a token.pos_ == "PUNCT", but then I realized it would disable flipping on (much more common) lists of first names

In [0]:
pub_flip("Ellos viven en Savannah, Georgia")

'Ellas viven en Eugenio , Ernesto'

In [0]:
for token in nlp("Savannah, Georgia"):
  print(token.text + ": " + token.tag_)

Savannah: PROPN___
,: PUNCT__PunctType=Comm
Georgia: PROPN___


In [0]:
pub_flip("¿Dónde están Justo, Juan, y Alicia?")

'¿ Dónde están Sara , Purificación , y Mario ?'

Debugging el/la lingüista issue

In [0]:
for token in nlp("el lingüista"):
  print(token.text + ": "  + token.tag_)

for token in nlp("la biblioteca"):
  print(token.text + ": "  + token.tag_)

el: DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art
lingüista: NOUN__Number=Sing
la: DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art
biblioteca: NOUN__Gender=Fem|Number=Sing


In [0]:
pub_flip("el lingüista nuevo")

'la lingüista nueva'

In [0]:
pub_flip("él fue a la biblioteca.")

'ella fue a la biblioteca .'

In [0]:
pub_flip("En el siglo XX el lingüista estadounidense Noam Chomsky creó la corriente conocida como generativismo.")

'En la siglo XX la lingüista estadounidense Carolina Chomsky creó la corriente conocida como generativismo .'

Also problems when mirroring with capitalized Nosotros/Nosotras... spaCy labels Nosotros as having gender, and Nosotras as... not??

In [0]:
for token in nlp("Nosotras corrimos en el parque"):
  print(token.text + ": " + token.tag_)

Nosotras: PRON__Number=Sing|Person=1
corrimos: VERB__Mood=Cnd|Number=Plur|Person=1|VerbForm=Fin
en: ADP__AdpType=Prep
el: DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art
parque: NOUN__Gender=Masc|Number=Sing


In [0]:
closest_word("Nosotras", 1 * noun_diff)

0.5687854


'nos'

In [0]:
print(pub_flip("Nosotros corrimos en el parque"))
pub_flip("Nosotras corrimos en el parque")

nosotras corrimos en el parque


'nosotros corrimos en el parque'

# Could counterfactuals flip to -@s endings?

The closest word to "amig@s" known to the language model is "Ami", but only because the beginning was separated

In [0]:
closest_word("amig@s", 0 * noun_diff)

0.49993205


'Ami'

In [0]:
encodeAt = tokenizer.encode("amig@s")
print(encodeAt)
print(tokenizer.decode(encodeAt))

[4, 1822, 30948, 3, 1020, 5]
[CLS] amig [UNK] s [SEP]


A word ending with "-e" or "-es" is also not recognized in the language model

In [0]:
encodeAt = tokenizer.encode("maestre")
print(encodeAt)
print(tokenizer.decode(encodeAt))

[4, 8062, 1297, 5]
[CLS] maestre [SEP]


In [0]:
encodeAt = tokenizer.encode("soldades")
print(encodeAt)
print(tokenizer.decode(encodeAt))

[4, 1505, 1356, 5]
[CLS] soldades [SEP]
