<a href="https://colab.research.google.com/github/Ikhadija-5/AMMI-NLP-LABS/blob/main/Copy_of_intro_to_wordvectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<h1 style="font-family:verdana;font-size:300%;text-align:center;background-color:#f2f2f2;color:#0d0d0d">AMMI NLP - Review sessions</h1>

<h1 style="font-family:verdana;font-size:180%;text-align:Center;color:#993333"> Lab 2: Introduction to wordvectors </h1>

**Big thanks to Amr Khalifa who improved this lab and made it to a Jupyter Notebook!**

In [71]:
import io, sys
import numpy as np

In [72]:
def load_vectors(filename):
    fin = io.open(filename, 'r', encoding='utf-8', newline='\n')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = np.asarray([float(x) for x in tokens[1:]])
    return data

In [73]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [74]:
# Loading word vectors

print('')
print(' ** Word vectors ** ')
print('')

'''
word_vectors is a dictionary that maps words to their numerical word vector
[word (string)] = [np-array] 
'''
word_vectors = load_vectors('wiki.en.vec')

tree_vector = word_vectors['tree']
print(type(tree_vector), len(tree_vector))


 ** Word vectors ** 

<class 'numpy.ndarray'> 300


In [75]:
## This function computes the cosine similarity between vectors u and v

def cosine(u, v):
    '''
    Parameters:
    u : 1-D numpy array
    v : 1-D numpy array 
    
    Returns:
    cos (float) : value of the cosine similairy between vectors u, v 
    '''
    
    ## FILL CODE
    
    dot_prod = np.dot(u,v) #We first compute the dot product between u and v 
    norm_u = np.sqrt(np.sum(u * u))  # Compute the L2 norm of u 
    norm_v = np.sqrt(np.sum(v * v))  # Compute the L2 norm of v 
    cos = dot_prod / (norm_u * norm_v)
    ### END CODE HERE ###
    
    return cos 


In [76]:
# compute similarity between words

print('similarity(apple, apples) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['apples']))
print('similarity(apple, banana) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['banana']))
print('similarity(apple, tiger) = %.3f' %
      cosine(word_vectors['apple'], word_vectors['tiger']))

similarity(apple, apples) = 0.637
similarity(apple, banana) = 0.431
similarity(apple, tiger) = 0.212


In [77]:
## Functions for nearest neighbor
## This function returns the word corresponding to 
## nearest neighbor vector of x
## The list exclude_words can be used to exclude some
## words from the nearest neighbors search

def nearest_neighbor(x, word_vectors, exclude_words=[]):
    '''
    Parameters:
    x (string): word to find its nearest neighbour 
    word_vectors (Python dict): {word (string): np-array of word vector}
    exclude_words (list of strings): words to be excluded from the search
    
    Returns:
    best_word (string) : the word whose word vector is the nearest neighbour 
    to the word vector of x
    '''
    best_score = -1.0
    best_word = None
    

    ## FILL CODE
    #x_vec = word_vectors[x]
    for word in word_vectors:
      w_vec = word_vectors[word]
      scr = cosine(x,w_vec)
      if scr >= best_score and word not in exclude_words:
        best_word = word
        best_score = scr


    return best_word

In [78]:
print('')

print('The nearest neighbor of shoe is: ' + nearest_neighbor(word_vectors['shoe'], word_vectors, exclude_words = ['shoe', 'shoes'])
)


The nearest neighbor of shoe is: boots


#### Hint (using python priorty queues with the heapq datastructure): 
if you don't want to store all the words and scores you can use the priortiy queue and only store the best K element so far. 

In [79]:

## This function return the words corresponding to the
## K nearest neighbors of vector x.
## You can use the functions heappush and heappop.

def knn(x, word_vectors, k):
    '''
    Parameters:
    x (string): word to find its nearest neighbour 
    word_vectors (Python dict): {word (string): np-array of word vector}
    k (int): number of nearest neighbours to be found
    
    Returns: 
    k_nearest_neighbors (list of tuples): [(score, word), (score, word), ....]
    '''

    k_nearest_neighbors = None
    neighbors = []
    x_vec = word_vectors[x]
    ## FILL CODE
    for word, val in word_vectors.items():
      scores = cosine(x_vec,val)
      neighbors.append((scores,word))
    
    
    neighbors.sort(key=lambda x:x[0], reverse=True)
    k_nearest_neighbors = neighbors[:k]


      
        
    return k_nearest_neighbors

In [80]:
knn_shoe = knn('shoe', word_vectors, 5)
print('')
print('shoe')
print('--------------')
for score, word in knn('shoe', word_vectors, 5):
    print (word + '\t%.3f' % score)



shoe
--------------
shoe	1.000
shoes	0.769
boots	0.619
clothing	0.577
leather	0.573


#### Hint: 
To find the analogies, we find the nearest neighbour associated with the wordvector d
$$ d = \frac{c}{\Vert {c} \Vert} + \frac{b}{\Vert {b} \Vert} - \frac{a}{\Vert {a} \Vert}$$


In [81]:
## This function return the words d, such that a:b and c:d
## verifies the same relation

def analogy(a, b, c, word_vectors):
    '''
    Parameters:
    a (string): word a
    b (string): word b
    c (string): word c
    word_vectors (Python dict): {word (string): np-array of word vector}
    
    Returnrs: 
    the word d (string) associated with c such that c:d is similar to a:b 
    
    '''

    

    ## FILL CODE
    d = (word_vectors[c] / (np.linalg.norm(word_vectors[c]))) + (word_vectors[b] / (np.linalg.norm(word_vectors[b]))) - (word_vectors[a] / (np.linalg.norm(word_vectors[a])))
    #print(type(d))
    analogy = nearest_neighbor(d,word_vectors,exclude_words=[])
    
    return analogy


In [44]:
# Word analogies

print('')
print('france - paris + rome = ' + analogy('paris', 'france', 'rome', word_vectors))


france - paris + rome = italy


## A word about biases in word vectors

In [82]:
## A word about biases in word vectors:

print('')
print('similarity(genius, man) = %.3f' %
      cosine(word_vectors['man'], word_vectors['genius']))
print('similarity(genius, woman) = %.3f' %
      cosine(word_vectors['woman'], word_vectors['genius']))


similarity(genius, man) = 0.445
similarity(genius, woman) = 0.325


In [83]:
## Compute the association strength between:
##   - a word w
##   - two sets of attributes A and B

def association_strength(w, A, B, vectors):
    '''
    Parameters:
    w (string): word w
    A (list of strings): The words belonging to set A
    B (list of strings): The words belonging to set B
    vectors (Python dict): {word (string): np-array of word vector}
    
    Returnrs: 
    strength (float): the value of the association strength 
    '''
    
    strength = 0.0
    part_a = 0.0
    part_b = 0.0 

    ## FILL CODE

    for a in A:
      part_a += (1/len(A)) * cosine(vectors[w],vectors[a])
    for b in B:
      part_b += (1/len(B)) * cosine(vectors[w],vectors[b])
    strength = part_a - part_b
    
    return strength

In [84]:
## Perform the word embedding association test between:
##   - two sets of words X and Y
##   - two sets of attributes A and B

def weat(X, Y, A, B, vectors):
    '''
    Parameters:
    X (list of strings): The words belonging to set X
    Y (list of strings): The words belonging to set Y
    A (list of strings): The words belonging to set A
    B (list of strings): The words belonging to set B
    vectors (Python dict): {word (string): np-array of word vector}
    
    Returns: 
    score (float): the value of the group association strength  
    '''
    
    score = 0.0 
    
    ## FILL CODE
    part_one = []
    part_two = []
    for x in X:
       part_one.append(association_strength(x,A,B,word_vectors))
    for y in Y:
       part_two.append(association_strength(y,A,B,word_vectors))
    score = np.sum(part_one) - np.sum(part_two)
    return score

In [85]:
## Replicate one of the experiments from:
##
## Semantics derived automatically from language corpora contain human-like biases
## Caliskan, Bryson, Narayanan (2017)

career = ['executive', 'management', 'professional', 'corporation', 
          'salary', 'office', 'business', 'career']
family = ['home', 'parents', 'children', 'family',
          'cousins', 'marriage', 'wedding', 'relatives']
male = ['john', 'paul', 'mike', 'kevin', 'steve', 'greg', 'jeff', 'bill']
female = ['amy', 'joan', 'lisa', 'sarah', 'diana', 'kate', 'ann', 'donna']

print('')
print('Word embedding association test: %.3f' %
      weat(career, family, male, female, word_vectors))


Word embedding association test: 0.847


## Word translation using word vectors

In the following, we will use word vectors in English and French to translate words from English to French. The idea is to learn a linear function that maps English word vectors to their correponding French word vectors. To learn this linear mapping, we will use a small bilingual lexicon, that contains pairs of words in English and French that are translations of each other.

The following function will load the small English-French bilingual lexicon:

In [86]:
def load_lexicon(filename):
    '''
    Parameters:
    filename(string): the path of the lexicon
    
    Returns:
    data(list of pairs of string): the bilingual lexicon
    '''
    fin = io.open(filename, 'r', encoding='utf-8', newline='\n')
    data = []
    for line in fin:
        a, b = line.rstrip().split(' ')
        data.append((a, b))
    return data

In [93]:
word_vectors_en = load_vectors('/content/drive/MyDrive/session2/wiki.en.vec')
word_vectors_fr = load_vectors('/content/drive/MyDrive/session2/wiki.fr.vec')
lexicon = load_lexicon("/content/lexicon-en-fr.txt")
print(lexicon[:5])

[('the', 'le'), ('the', 'les'), ('the', 'la'), ('and', 'et'), ('was', 'fut')]


In [94]:
# We split the lexicon into a train and validation set
train = lexicon[:5000]
valid = lexicon[5000:5100]

The following function will learn the mapping from English to French. The idea is to build two matrices $X_{\text{en}}$ and $X_{\text{fr}}$, and to find a mapping $M$ that minimizes $||X_{\text{en}} W - X_{\text{fr}} ||_2$. In numpy, this mapping can be obtained by using the `numpy.linalg.lstsq` function.

In [132]:
def align(word_vectors_en, word_vectors_fr, lexicon):
    '''
    Parameters:
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    lexicon(list of pairs of string): bilingual training lexicon
    
    Returns
    mapping(np.array): the mapping from English to French vectors
    '''
    x_en, x_fr = [], []
    for en in lexicon:
      x_en.append(word_vectors_en[en[0]])
    for fr in lexicon:
      x_fr.append(word_vectors_fr[fr[1]])

    M = np.linalg.lstsq(np.array(x_en),np.array(x_fr),rcond=None)[0]


    ## FILL CODE
    
    return M

In [133]:
mapping = align(word_vectors_en, word_vectors_fr, lexicon)
type(mapping)

numpy.ndarray

Given a mapping, a set of word English word vector and French word vectors, the next function will translate the English word to French. To do so, we apply the mapping on the English word, and retrieve the nearest neighbor of the obtained vector in the set of French word vectors. The translation is then the corresponding French word.

In [136]:
def translate(word, word_vectors_en, word_vectors_fr, mapping):
    '''
    Parameters:
    word(string): an English word
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    mapping(np.array): the mapping from English to French vectors
    
    Returns
    A string containing the translation of the English word
    '''
    # print(word_vectors_en[word].shape)
    # print(len())
    result = word_vectors_en[word] @ mapping
    


    ## FILL CODE

    return nearest_neighbor(result,word_vectors_fr,exclude_words=[])


In [137]:
print(translate("world", word_vectors_en, word_vectors_fr, mapping))
print(translate("machine", word_vectors_en, word_vectors_fr, mapping))
print(translate("learning", word_vectors_en, word_vectors_fr, mapping))

monde
machine
apprentissage


Finally, let's implement a function to evaluate this method on the validation lexicon:

In [142]:
def evaluate(valid, word_vectors_en, word_vectors_fr, mapping):
    '''
    Parameters:
    valid(a list of pairs of string): the validation lexicon
    word_vectors_en(dict: string -> np.array): English word vectors
    word_vectors_en(dict: string -> np.array): French word vectors
    mapping(np.array): the mapping from English to French vectors
    
    Returns
    Accuracy(float): the accuracy on the validation lexicon
    '''
    acc, n = 0.0, 0

    for i in valid:
      if translate(i[0],word_vectors_en,word_vectors_fr,mapping) == i[1]:
        acc+=1
      n+=1
    
    ## FILL CODE

    return acc/n

In [143]:
evaluate(valid, word_vectors_en, word_vectors_fr, mapping)

0.64