<a href="https://colab.research.google.com/github/DrAlexSanz/glove/blob/master/Glove.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [39]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [40]:
%cd "/content/drive/My Drive/Glove"

/content/drive/My Drive/Glove


In [0]:
import numpy as np
from w2v_utils import *

Read data, the GloVe 50d vectors.

In [42]:
#Words is the set of words of the "English vocabulary". word_to_vec_map is the char_to_ix of the previous week. It's the vector form of the vocabulary.
words, word_to_vec_map = read_glove_vecs("glove.6B.50d.txt")
print("Everything is read")

Everything is read


One of the key concepts is the cosine similarity. This is a scalar product in algebra. 2 vectors in a vector space will be the same if their cosine is 1 and they will be opposite when their cosine is -1. At 0 they are ortogonal. It's just the definition of scalar product.

$cos \theta = \frac{\overrightarrow{u} \cdot \overrightarrow{v}}{|\overrightarrow{u}| \cdot |\overrightarrow{v}|}$

In [0]:
def cosine_similarity(u, v):
    """
    This function takes the two vectors u and v and calculates the cosine of the angle between them.
    
    u, v are vectors of shape (n,)
    
    returns cosine which is a scalar   
    
    """
    
    dot = np.dot(u, v)
    
    u_norm = np.linalg.norm(u, ord = 2)
    v_norm = np.linalg.norm(v, ord = 2)
    
    cosine = dot / (u_norm * v_norm)
    
    return cosine

In [44]:
father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
ball = word_to_vec_map["ball"]
crocodile = word_to_vec_map["crocodile"]
france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
paris = word_to_vec_map["paris"]
rome = word_to_vec_map["rome"]

print("Similarity father/mother = ", cosine_similarity(father, mother))
print("Similarity ball/crocodile = ", cosine_similarity(ball, crocodile))
print("Similarity france - paris / italy - rome = ", cosine_similarity(france - paris, italy - rome))




Similarity father/mother =  0.8909038442893615
Similarity ball/crocodile =  0.2743924626137942
Similarity france - paris / italy - rome =  0.6751479308174202


Now I will do the word analogy. Namely, check if $e_a - e_b \approx e_c - e_d$. I will obviously use the cosine similarity for this. The input are the first three words/vectors and I will loop through all the vocabulary to obtain the most similar word to the 3rd one.

In [0]:
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    This function takes 3 words and the mapping to find the 4th word in an analogy
    a is to b as c is to...
    
    output is a word    
    
    """
    
    word_a = word_a.lower() #In case the user didn't do it
    word_b = word_b.lower()
    word_c = word_c.lower()
    
    # Get the vectors
    
    e_a = word_to_vec_map[word_a]
    e_b = word_to_vec_map[word_b]
    e_c = word_to_vec_map[word_c]
    
    words = word_to_vec_map.keys() # This is my list of words
    
    max_cos_similarity = -1e3
    best_word = None
    
    # Go over the whole set, detect if I am choosing the same word
    for w in words:
        if w in [word_a, word_b, word_c]:
            continue
        else:
            cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)
            
            if cosine_sim > max_cos_similarity:
                max_cos_similarity = cosine_sim
                best_word = w
                
    return best_word
        

In [46]:
triads = [("italy", "italian", "spain"), ("paris", "france", "Madrid"), ("man", "woman", "boy"), ("small", "smaller", "huge")]

for triad in triads:
    print("{} is to {} as {} is to {}".format(*triad, complete_analogy(*triad, word_to_vec_map)))

italy is to italian as spain is to spanish
paris is to france as Madrid is to spain
man is to woman as boy is to girl
small is to smaller as huge is to revenues


The last one is funny. It's also funny how if I change the order of the second one it finds aires instead of Madrid. I guess it's from Buenos Aires. Damn argentinians.

Now let's do the debiasing. In english this is quite straightforward. In a given base, in which one of the axes is gender, father and mother should be symmetrical respect to this axis. Another way of seeing this, a neutral word like doctor, should be one of the axes of symmetry between man and woman. This may not always be the case due to biases that may be present in the corpus. Let's see this.

In [47]:
g = word_to_vec_map["man"] - word_to_vec_map["woman"] # This is the gender axis

print(g)

[ 0.087144   -0.2182      0.40986     0.03922     0.1032     -0.94165
  0.06042    -0.32988    -0.46144     0.35962    -0.31102     0.86824
 -0.96006    -0.01073    -0.24337    -0.08193     1.02722     0.21122
 -0.695044    0.00222    -0.29106    -0.5053      0.099454   -0.40445
 -0.30181    -0.1355      0.0606      0.07131     0.19245     0.06115
  0.3204     -0.07165     0.13337     0.25068714  0.14293     0.224957
  0.149      -0.048882   -0.12191     0.27362     0.165476    0.20426
 -0.54376     0.271425    0.10245     0.32108    -0.2516      0.33455
  0.04371    -0.01258   ]


Now compute the similarity between male names and g and between female names and g.

In [48]:
names = ["alex", "john", "charlie", "mary", "kelly", "katy"]

for w in names:
    print(w, cosine_similarity(word_to_vec_map[w], g))

alex 0.1289169612113885
john 0.23163356145973724
charlie 0.15980474924204568
mary -0.2442842158649063
kelly 0.040351556806825936
katy -0.2831068659572615


Male names tend to be positive and female names tend to be negative. This makes sense and there is no obvious problem. Let's see with other words.

In [49]:
words = ["gun", "lipstick", "truck", "baby", "motorbike", "science", "arts"]

for w in words:
    print(w, cosine_similarity(word_to_vec_map[w], g))

gun 0.1172836185493845
lipstick -0.2769191625638267
truck 0.032396832512384365
baby -0.1925003258097509
motorbike -0.12532685677161912
science 0.06082906540929701
arts -0.008189312385880337


There are some words that should be neutral (close to 0, perpendicular to the axis) that are not. Neutralize them.

In [0]:
def neutralize(word, g, word_to_vec_map):
    """
    This function takes a word and an axis (gender in this case), and produces the new coordinates
    
    input: a word, the axis and the mapping.
    
    output: the new vector debiased
    
    """
    
    e = word_to_vec_map[word]
    
    # Projection of a vector on a given axis!
    e_axis_component = np.dot(e, g)/np.dot(g, g) * g
    
    e_debiased = e - e_axis_component
    
    return e_debiased

In [51]:
e = "receptionist"

print("Cosine similarity between " + e + " and g before neutralizing", cosine_similarity(word_to_vec_map[e], g))
e_deb = neutralize(e, g, word_to_vec_map) #This is already a vector!!
print("Cosine similarity between " + e + " and g after neutralizing", cosine_similarity(e_deb, g))

Cosine similarity between receptionist and g before neutralizing -0.3307794175059374
Cosine similarity between receptionist and g after neutralizing 2.6832242276243644e-17


Finally, sex specific words like actor and actress should be equalized. In theory actor is as masculine as actress is feminine, but it may not be this case. The derivation of this is [here](https://arxiv.org/abs/1607.06520).

In [0]:
def equalize(pair, axis, word_to_vec_map):
    """
    This function takes a pair of words and an axis and equalizes the projection.
    
    input: pair of words, axis g (50 d) and mapping
    
    output: e1 and e2, the two new vectors
    
    """
    
    w1 = pair[0]
    w2 = pair[1]
    
    e_1 = word_to_vec_map[w1]
    e_2 = word_to_vec_map[w2]
    
    mu = 0.5 * (e_1 + e_2)
    
    # Projection of e1 and e2 on mu and mu_orthogonal
    
    mu_B = np.dot(mu, g) * g / np.linalg.norm(g, ord = 2)**2
    mu_orth = mu - mu_B
    
    e_1B = np.dot(e_1, g) * g / np.linalg.norm(g, ord = 2)**2
    e_2B = np.dot(e_2, g) * g / np.linalg.norm(g, ord = 2)**2
    
    e_1_new = e_1B - mu_B + mu_orth
    e_2_new = e_2B - mu_B + mu_orth
    
    return e_1_new, e_2_new
    

In [79]:
print("Cos similarities before equalizing: ")
print("Cos similarity between man and g = ", cosine_similarity(word_to_vec_map["man"], g))
print("Cos similarity between woman and g = ", cosine_similarity(word_to_vec_map["woman"], g))
print("Cos similarities after equalizing: ")

man_2, woman_2 = equalize(("man", "woman"), g, word_to_vec_map)

print("Cos similarity between man and g = ", cosine_similarity(man_2, g))
print("Cos similarity between woman and g = ", cosine_similarity(woman_2, g))

Cos similarities before equalizing: 
Cos similarity between man and g =  0.11711095765336832
Cos similarity between woman and g =  -0.35666618846270376
Cos similarities after equalizing: 
Cos similarity between man and g =  0.24239737572313433
Cos similarity between woman and g =  -0.2423973757231343
