<a href="https://colab.research.google.com/github/Bhandari007/sequence_model_course/blob/main/Operation_on_word_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Downloading helper functions and data

In [1]:
!wget https://raw.githubusercontent.com/abdur75648/Deep-Learning-Specialization-Coursera/main/Sequence%20Models/week2/w2a1/generateTestCases.py
!wget https://raw.githubusercontent.com/abdur75648/Deep-Learning-Specialization-Coursera/main/Sequence%20Models/week2/w2a1/w2v_utils.py
!wget https://raw.githubusercontent.com/abdur75648/Deep-Learning-Specialization-Coursera/main/Sequence%20Models/week2/w2a1/data/input.txt -p /data/

--2022-10-20 06:27:19--  https://raw.githubusercontent.com/abdur75648/Deep-Learning-Specialization-Coursera/main/Sequence%20Models/week2/w2a1/generateTestCases.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15806 (15K) [text/plain]
Saving to: ‘generateTestCases.py’


2022-10-20 06:27:19 (31.5 MB/s) - ‘generateTestCases.py’ saved [15806/15806]

--2022-10-20 06:27:19--  https://raw.githubusercontent.com/abdur75648/Deep-Learning-Specialization-Coursera/main/Sequence%20Models/week2/w2a1/w2v_utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response..

In [9]:
#!wget https://github.com/uclnlp/inferbeddings/blob/master/data/glove/glove.6B.50d.txt.gz?raw=true
!gzip -d glove.6B.50d.txt.gz

Word Embeddings are very  computationally expensive to train, most ML practioners will load a pre-trained set of embeddings. In this notebook, we'll try our hand at loading, measuring similarity between and modifying pre-trained embeddings.

**Objectives:**

* Explain how word embeddings capture relationships between words
* Load pre-trained word vectors
* Measure similarity between word vectors using cosine similarity
* Use embeddings to solve word analogy problems such as Man is to Woman as King is to __.

# Packages

In [2]:
import numpy as np
from w2v_utils import *

# 1- Load the Word Vectors

For this notebook, we'll use 50-dimensional GloVe vectors to represent words.

In [10]:
words, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

# 2 - Embedding Vectors Versus One-Hot Vectors

One-Hot Vectors don't do a good job of capturing the level of similarity between words. This is because every one-hot vector has the same Euclidean distance from any other one-hot vector.

Embedding vectors, such as GloVe vectors, provide much more useful information about the meaning of individual words.<br>
Now, see how we can use GloVe vectors to measure the similarity between two words!

# 3 - Cosine Similarity

To measure the similarity between two words, we need a way to measure the degrre of similarity between two embedding vectors for the two words. Given two vectors *u* and *v*, cosine similarity is defined as follows:

CosineSimilarity(u,v) = (u.v) /( ||u||.||v||) = cos(theta)

* The cosine similarity between on the angle between u and v.
  * If u and v are very similar, their cosine similarity will be close to 1.
  * If they are dissimilar, this cosine similarity will take a smaller value.


### Exercise 1- cosine_similarity
Implement the function `cosine_similarity()`

In [12]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v
        
    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    # Special case. Consider the case u = [0,0], v = [0,0]
    if np.all(u == v):
      return 1
    
    dot_product = np.dot(u,v)
    (norm_1, norm_2) = (np.linalg.norm(u), np.linalg.norm(v))

    cosine_similarity = dot_product / (norm_1 * norm_2)

    # Avoid division by 0
    if np.isclose(norm_1* norm_2,0,atol=1e-32):
      return 0

    return cosine_similarity


In [13]:
# START SKIP FOR GRADING
father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
ball = word_to_vec_map["ball"]
crocodile = word_to_vec_map["crocodile"]
france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
paris = word_to_vec_map["paris"]
rome = word_to_vec_map["rome"]

print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))
print("cosine_similarity(ball, crocodile) = ",cosine_similarity(ball, crocodile))
print("cosine_similarity(france - paris, rome - italy) = ",cosine_similarity(france - paris, rome - italy))
# END SKIP FOR GRADING

# PUBLIC TESTS
def cosine_similarity_test(target):
    a = np.random.uniform(-10, 10, 10)
    b = np.random.uniform(-10, 10, 10)
    c = np.random.uniform(-1, 1, 23)
        
    assert np.isclose(cosine_similarity(a, a), 1), "cosine_similarity(a, a) must be 1"
    assert np.isclose(cosine_similarity((c >= 0) * 1, (c < 0) * 1), 0), "cosine_similarity(a, not(a)) must be 0"
    assert np.isclose(cosine_similarity(a, -a), -1), "cosine_similarity(a, -a) must be -1"
    assert np.isclose(cosine_similarity(a, b), cosine_similarity(a * 2, b * 4)), "cosine_similarity must be scale-independent. You must divide by the product of the norms of each input"

    print("\033[92mAll test passed!")
    
cosine_similarity_test(cosine_similarity)

cosine_similarity(father, mother) =  0.8909038442893615
cosine_similarity(ball, crocodile) =  0.2743924626137942
cosine_similarity(france - paris, rome - italy) =  -0.6751479308174202
[92mAll test passed!


# 4 - Word Analogy Task

* In the word analogy task, complete this sentence:
  "*a* is to *b* as *c* is to **____**".

* An example is:
  "*man*" is to "*woman*" as *king* is to *queen*

* We're trying to find a word *d*, such that the associated word vectors *ea, eb, ec, ed* are related in the following manner:
                eb - ea ~ ed - ec

* Measure the similarity between eb-ea and ed-ec using cosine similarity.

### Exercise 2 - complete analogy


In [14]:
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____. 
    
    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors. 
    
    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """
    
    # convert to lowercase
    word_a, word_b, word_c= (word_a.lower(), word_b.lower(), word_c.lower())

    # get the word embeddings e_a, e_b, e_c
    e_a, e_b , e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]

    words = word_to_vec_map.keys()
    max_cosine_sim = -100
    best_word = None

    for w in words:
      if w == word_c:
        continue
      
      cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)

      if cosine_sim > max_cosine_sim:
        max_cosine_sim = cosine_sim
        best_word = w
    
    return best_word

In [15]:
# PUBLIC TEST
def complete_analogy_test(target):
    a = [3, 3] # Center at a
    a_nw = [2, 4] # North-West oriented vector from a
    a_s = [3, 2] # South oriented vector from a
    
    c = [-2, 1] # Center at c
    # Create a controlled word to vec map
    word_to_vec_map = {'a': a,
                       'synonym_of_a': a,
                       'a_nw': a_nw, 
                       'a_s': a_s, 
                       'c': c, 
                       'c_n': [-2, 2], # N
                       'c_ne': [-1, 2], # NE
                       'c_e': [-1, 1], # E
                       'c_se': [-1, 0], # SE
                       'c_s': [-2, 0], # S
                       'c_sw': [-3, 0], # SW
                       'c_w': [-3, 1], # W
                       'c_nw': [-3, 2] # NW
                      }
    
    # Convert lists to np.arrays
    for key in word_to_vec_map.keys():
        word_to_vec_map[key] = np.array(word_to_vec_map[key])
            
    assert(target('a', 'a_nw', 'c', word_to_vec_map) == 'c_nw')
    assert(target('a', 'a_s', 'c', word_to_vec_map) == 'c_s')
    assert(target('a', 'synonym_of_a', 'c', word_to_vec_map) != 'c'), "Best word cannot be input query"
    assert(target('a', 'c', 'a', word_to_vec_map) == 'c')

    print("\033[92mAll tests passed")
    
complete_analogy_test(complete_analogy)

[92mAll tests passed




In [16]:
# START SKIP FOR GRADING
triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad, word_to_vec_map)))

# END SKIP FOR GRADING

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> smaller


In [17]:
print(complete_analogy('man','woman','father',word_to_vec_map))

mother
