## Evaluation with Standard Benchmarks: Coherence
### Using evaluation tool for word embeddings

Here, we apply standard benchmarks on coherence on w2v and debiased w2v.

Sources:

#### RG: H. Rubenstein and J. B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.

####  WS: L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept  revisited. In WWW. ACM, 2001.

####  Wordsim benchmarks - Code adapted from source - embedding-evaluation: https://github.com/k-kawakami/embedding-evaluation


In [1]:
# Subset of GoogleNews-vectors:
# https://drive.google.com/file/d/1NH6jcrg8SXbnhpIXRIXF_-KUE7wGxGaG/view?usp=sharing

# For full embeddings:
# Download embeddings at https://github.com/tolga-b/debiaswe and put them on the following directory
# embeddings/GoogleNews-vectors-negative300-hard-debiased.bin
# embeddings/GoogleNews-vectors-negative300.bin

In [2]:
from __future__ import print_function, division
%matplotlib inline
from matplotlib import pyplot as plt
import json
import random
import numpy as np
import os

import debiaswe as dwe
import debiaswe.we as we
from debiaswe.we import WordEmbedding
from debiaswe.data import load_professions

from debiaswe.benchmarks import Benchmark

# Small w2vNEWS set

## 1: original word embeddings on RG & WS

In [3]:
# Load google news word2vec
E = WordEmbedding('./embeddings/w2v_gnews_small.txt')
# Evaluate
benchmark = Benchmark()
result_original = benchmark.evaluate(E, "'Before', small dataset")

*** Reading data from ./embeddings/w2v_gnews_small.txt
(26423, 300)
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
Processing batch 1 of 40
Processing batch 2 of 40
Processing batch 3 of 40
Processing batch 4 of 40
Processing batch 5 of 40
Processing batch 6 of 40
Processing batch 7 of 40
Processing batch 8 of 40
Processing batch 9 of 40
Processing batch 10 of 40
Processing batch 11 of 40
Processing batch 12 of 40
Processing batch 13 of 40
Processing batch 14 of 40
Processing batch 15 of 40
Processing batch 16 of 40
Processing batch 17 of 40
Processing batch 18 of 40
Processing batch 19 of 40
Processing batch 20 of 40
Processing batch 21 of 40
Processing batch 22 of 40
Processing batch 23 of 40
Processing batch 24 of 40
Processing batch 25 of 40
Processing batch 26 of 40
Processing batch 27 of 40
Processing batch 28 of 40
Processing batch 29 of 40
Processing batch 30 of 40
Processing batch 31 of 40
Processing batch 32 of 40
Processing batch 33 of 40
P

## 2: Debiased word embeddings on RG & WS


### Step 2a: Hard debiased

In [4]:
from debiaswe.debias import hard_debias

# Path for hard_debiased embedding file 
hard_embedding_file = './embeddings/w2v_gnews_small_hard_debiased.txt' 

In [5]:
if os.path.exists(hard_embedding_file):
    E_hard = WordEmbedding(hard_embedding_file)

else:
    with open('./data/definitional_pairs.json', "r") as f:
        defs = json.load(f)

    with open('./data/equalize_pairs.json', "r") as f:
        equalize_pairs = json.load(f)

    with open('./data/gender_specific_seed.json', "r") as f:
        gender_specific_words = json.load(f)
        
    E_hard = WordEmbedding('./embeddings/w2v_gnews_small.txt')        
    hard_debias(E_hard, gender_specific_words, defs, equalize_pairs)

*** Reading data from ./embeddings/w2v_gnews_small.txt
(26423, 300)
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
{('GENTLEMEN', 'LADIES'), ('HE', 'SHE'), ('MAN', 'WOMAN'), ('uncle', 'aunt'), ('Prince', 'Princess'), ('dudes', 'gals'), ('nephew', 'niece'), ('Grandfather', 'Grandmother'), ('Men', 'Women'), ('Father', 'Mother'), ('COLT', 'FILLY'), ('COUNCILMAN', 'COUNCILWOMAN'), ('Males', 'Females'), ('Brother', 'Sister'), ('brother', 'sister'), ('NEPHEW', 'NIECE'), ('father', 'mother'), ('Sons', 'Daughters'), ('wives', 'husbands'), ('CONGRESSMAN', 'CONGRESSWOMAN'), ('Son', 'Daughter'), ('Congressman', 'Congresswoman'), ('himself', 'herself'), ('MEN', 'WOMEN'), ('CATHOLIC_PRIEST', 'NUN'), ('grandfather', 'grandmother'), ('HIMSELF', 'HERSELF'), ('TESTOSTERONE', 'ESTROGEN'), ('Boys', 'Girls'), ('Fella', 'Granny'), ('Schoolboy', 'Schoolgirl'), ('GENTLEMAN', 'LADY'), ('Chairma

In [6]:
# Evaluate for hard-debiased
result_hard_debiased = benchmark.evaluate(E_hard, "'Hard-debiased', small dataset")

Processing batch 1 of 40
Processing batch 2 of 40
Processing batch 3 of 40
Processing batch 4 of 40
Processing batch 5 of 40
Processing batch 6 of 40
Processing batch 7 of 40
Processing batch 8 of 40
Processing batch 9 of 40
Processing batch 10 of 40
Processing batch 11 of 40
Processing batch 12 of 40
Processing batch 13 of 40
Processing batch 14 of 40
Processing batch 15 of 40
Processing batch 16 of 40
Processing batch 17 of 40
Processing batch 18 of 40
Processing batch 19 of 40
Processing batch 20 of 40
Processing batch 21 of 40
Processing batch 22 of 40
Processing batch 23 of 40
Processing batch 24 of 40
Processing batch 25 of 40
Processing batch 26 of 40
Processing batch 27 of 40
Processing batch 28 of 40
Processing batch 29 of 40
Processing batch 30 of 40
Processing batch 31 of 40
Processing batch 32 of 40
Processing batch 33 of 40
Processing batch 34 of 40
Processing batch 35 of 40
Processing batch 36 of 40
Processing batch 37 of 40
Processing batch 38 of 40
Processing batch 39 o

### Step 2b: Soft debiased


In [7]:
from debiaswe.debias import soft_debias

# Path for soft_debiased embedding file 
soft_embedding_file = './embeddings/w2v_gnews_small_soft_debiased.txt' 

In [8]:
if os.path.exists(hard_embedding_file):
    E_soft = WordEmbedding(soft_embedding_file)
else:
    E_soft = WordEmbedding('./embeddings/w2v_gnews_small.txt')
    soft_debias(E_soft, gender_specific_words, defs, log=False)

*** Reading data from ./embeddings/w2v_gnews_small.txt
(26423, 300)
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine


In [9]:
# Evaluate for soft-debiased
result_soft_debiased = benchmark.evaluate(E_soft, "'Soft-debiased', small dataset")

Processing batch 1 of 40
Processing batch 2 of 40
Processing batch 3 of 40
Processing batch 4 of 40
Processing batch 5 of 40
Processing batch 6 of 40
Processing batch 7 of 40
Processing batch 8 of 40
Processing batch 9 of 40
Processing batch 10 of 40
Processing batch 11 of 40
Processing batch 12 of 40
Processing batch 13 of 40
Processing batch 14 of 40
Processing batch 15 of 40
Processing batch 16 of 40
Processing batch 17 of 40
Processing batch 18 of 40
Processing batch 19 of 40
Processing batch 20 of 40
Processing batch 21 of 40
Processing batch 22 of 40
Processing batch 23 of 40
Processing batch 24 of 40
Processing batch 25 of 40
Processing batch 26 of 40
Processing batch 27 of 40
Processing batch 28 of 40
Processing batch 29 of 40
Processing batch 30 of 40
Processing batch 31 of 40
Processing batch 32 of 40
Processing batch 33 of 40
Processing batch 34 of 40
Processing batch 35 of 40
Processing batch 36 of 40
Processing batch 37 of 40
Processing batch 38 of 40
Processing batch 39 o

In [10]:
benchmark.pprint_compare([result_original, result_hard_debiased, result_soft_debiased], ["Before", "Hard-debiased", "Soft-debiased"], "small")

+----------------------------------------------------------------------------+
|                         Results for small dataset                          |
+---------------+-------------------+-------------------+--------------------+
|     Score     |      EN-RG-65     |   EN-WS-353-ALL   |    MSR-analogy     |
+---------------+-------------------+-------------------+--------------------+
|     Before    | 77.66555804950227 | 68.82719646959825 | 46.79681576952237  |
| Hard-debiased | 77.49622028082247 | 68.52623098234018 | 46.967399545109934 |
| Soft-debiased | 77.66555804950227 | 68.82719646959825 | 46.79681576952237  |
+---------------+-------------------+-------------------+--------------------+


# Full W2vNEWS set

## 1: original word embeddings on RG & WS

### Wordsim benchmarks
Code adapted from source 

#### embedding-evaluation: https://github.com/k-kawakami/embedding-evaluation

In [11]:
# Load google news word2vec
E = WordEmbedding('./embeddings/GoogleNews-vectors-negative300.bin')
# Evaluate
benchmark = Benchmark()

*** Reading data from ./embeddings/GoogleNews-vectors-negative300.bin
(3000000, 300)
3000000 words of dimension 300 : </s>, in, for, that, ..., Bim_Skala_Bim, Mezze_Cafe, pulverizes_boulders, snowcapped_Caucasus
3000000 words of dimension 300 : </s>, in, for, that, ..., Bim_Skala_Bim, Mezze_Cafe, pulverizes_boulders, snowcapped_Caucasus


In [12]:
result_original = benchmark.evaluate(E, "'Before', full dataset")

Processing batch 1 of 40
Processing batch 2 of 40
Processing batch 3 of 40
Processing batch 4 of 40
Processing batch 5 of 40
Processing batch 6 of 40
Processing batch 7 of 40
Processing batch 8 of 40
Processing batch 9 of 40
Processing batch 10 of 40
Processing batch 11 of 40
Processing batch 12 of 40
Processing batch 13 of 40
Processing batch 14 of 40
Processing batch 15 of 40
Processing batch 16 of 40
Processing batch 17 of 40
Processing batch 18 of 40
Processing batch 19 of 40
Processing batch 20 of 40
Processing batch 21 of 40
Processing batch 22 of 40
Processing batch 23 of 40
Processing batch 24 of 40
Processing batch 25 of 40
Processing batch 26 of 40
Processing batch 27 of 40
Processing batch 28 of 40
Processing batch 29 of 40
Processing batch 30 of 40
Processing batch 31 of 40
Processing batch 32 of 40
Processing batch 33 of 40
Processing batch 34 of 40
Processing batch 35 of 40
Processing batch 36 of 40
Processing batch 37 of 40
Processing batch 38 of 40
Processing batch 39 o

## 2: Debiased word embeddings on RG & WS


### Step 2a: Hard debiased

In [13]:
from debiaswe.debias import hard_debias

# Path for hard_debiased embedding file 
# hard_embedding_file = './embeddings/GoogleNews-vectors-negative300_hard_debiased.txt' 

In [14]:
if os.path.exists(hard_embedding_file):
    E_hard = WordEmbedding(hard_embedding_file)

else:
    with open('./data/definitional_pairs.json', "r") as f:
        defs = json.load(f)

    with open('./data/equalize_pairs.json', "r") as f:
        equalize_pairs = json.load(f)

    with open('./data/gender_specific_seed.json', "r") as f:
        gender_specific_words = json.load(f)
        
    hard_debias(E_hard, gender_specific_words, defs, equalize_pairs)

26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
{('GENTLEMEN', 'LADIES'), ('HE', 'SHE'), ('MAN', 'WOMAN'), ('uncle', 'aunt'), ('Prince', 'Princess'), ('dudes', 'gals'), ('nephew', 'niece'), ('Grandfather', 'Grandmother'), ('Men', 'Women'), ('Father', 'Mother'), ('COLT', 'FILLY'), ('COUNCILMAN', 'COUNCILWOMAN'), ('Males', 'Females'), ('Brother', 'Sister'), ('brother', 'sister'), ('NEPHEW', 'NIECE'), ('father', 'mother'), ('Sons', 'Daughters'), ('wives', 'husbands'), ('CONGRESSMAN', 'CONGRESSWOMAN'), ('Son', 'Daughter'), ('Congressman', 'Congresswoman'), ('himself', 'herself'), ('MEN', 'WOMEN'), ('CATHOLIC_PRIEST', 'NUN'), ('grandfather', 'grandmother'), ('HIMSELF', 'HERSELF'), ('TESTOSTERONE', 'ESTROGEN'), ('Boys', 'Girls'), ('Fella', 'Granny'), ('Schoolboy', 'Schoolgirl'), ('GENTLEMAN', 'LADY'), ('Chairman', 'Chairwoman'), ('Fraternity', 'Sorority'), ('WIVES', 'HUSBANDS'), ('Boy', 'Girl'), ('businessman', 'businesswoman'), ('Ex_Girlfriend', 'Ex_Boyfrie

In [15]:
# Evaluate for hard-debiased
result_hard_debiased = benchmark.evaluate(E_hard, "'Hard-debiased', full dataset")

Processing batch 1 of 40
Processing batch 2 of 40
Processing batch 3 of 40
Processing batch 4 of 40
Processing batch 5 of 40
Processing batch 6 of 40
Processing batch 7 of 40
Processing batch 8 of 40
Processing batch 9 of 40
Processing batch 10 of 40
Processing batch 11 of 40
Processing batch 12 of 40
Processing batch 13 of 40
Processing batch 14 of 40
Processing batch 15 of 40
Processing batch 16 of 40
Processing batch 17 of 40
Processing batch 18 of 40
Processing batch 19 of 40
Processing batch 20 of 40
Processing batch 21 of 40
Processing batch 22 of 40
Processing batch 23 of 40
Processing batch 24 of 40
Processing batch 25 of 40
Processing batch 26 of 40
Processing batch 27 of 40
Processing batch 28 of 40
Processing batch 29 of 40
Processing batch 30 of 40
Processing batch 31 of 40
Processing batch 32 of 40
Processing batch 33 of 40
Processing batch 34 of 40
Processing batch 35 of 40
Processing batch 36 of 40
Processing batch 37 of 40
Processing batch 38 of 40
Processing batch 39 o

### Step 2b: Soft debiased

In [16]:
from debiaswe.debias import soft_debias

# Path for soft_debiased embedding file 
# soft_embedding_file = './embeddings/GoogleNews-vectors-negative300_soft_debiased.txt' 

In [17]:
if os.path.exists(hard_embedding_file):
    E_soft = WordEmbedding(soft_embedding_file)
else:
    soft_debias(E_soft, gender_specific_words, defs, log=False)

In [18]:
# Evaluate for soft-debiased
result_soft_debiased = benchmark.evaluate(E_soft, "'Soft-debiased', full dataset")

Processing batch 1 of 40
Processing batch 2 of 40
Processing batch 3 of 40
Processing batch 4 of 40
Processing batch 5 of 40
Processing batch 6 of 40
Processing batch 7 of 40
Processing batch 8 of 40
Processing batch 9 of 40
Processing batch 10 of 40
Processing batch 11 of 40
Processing batch 12 of 40
Processing batch 13 of 40
Processing batch 14 of 40
Processing batch 15 of 40
Processing batch 16 of 40
Processing batch 17 of 40
Processing batch 18 of 40
Processing batch 19 of 40
Processing batch 20 of 40
Processing batch 21 of 40
Processing batch 22 of 40
Processing batch 23 of 40
Processing batch 24 of 40
Processing batch 25 of 40
Processing batch 26 of 40
Processing batch 27 of 40
Processing batch 28 of 40
Processing batch 29 of 40
Processing batch 30 of 40
Processing batch 31 of 40
Processing batch 32 of 40
Processing batch 33 of 40
Processing batch 34 of 40
Processing batch 35 of 40
Processing batch 36 of 40
Processing batch 37 of 40
Processing batch 38 of 40
Processing batch 39 o

In [19]:
benchmark.pprint_compare([result_original, result_hard_debiased, result_soft_debiased], ["Before", "Hard-debiased", "Soft-debiased"], "full")

+----------------------------------------------------------------------------+
|                          Results for full dataset                          |
+---------------+-------------------+-------------------+--------------------+
|     Score     |      EN-RG-65     |   EN-WS-353-ALL   |    MSR-analogy     |
+---------------+-------------------+-------------------+--------------------+
|     Before    | 76.07828603850845 | 70.00166486272194 | 47.16604955853033  |
| Hard-debiased | 77.49622028082247 | 68.51231156178875 | 46.94844579226687  |
| Soft-debiased | 77.02046178786492 | 68.78939386096931 | 46.000758150113725 |
+---------------+-------------------+-------------------+--------------------+
