<a href="https://colab.research.google.com/github/Mohammed-khair/Exploring-word-embeddings/blob/main/word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Exploring word embeddings

In [None]:
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import KMeans

### Acknowledgements

In this notebook, the Glove vectors will be used for the embeddings.  

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. URL: https://nlp.stanford.edu/pubs/glove.pdf


We will define a function that converts the text file into a dictionary of words and embedding vectors

In [None]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding='utf-8') as f:
        words = set()
        word_to_vec_map = {}

        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

    return words, word_to_vec_map

We will use this function read the text file with the embeddings

In [None]:
words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

## Cosine similarity

To measure the similarity between two words, we need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.


In [None]:
def cosine_similarity(u, v, epsilon=1e-15):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)
        epsilon -- a small value to avoid division by zero

    Returns:
        cosine_similarity -- the cosine similarity between u and v
    """

    # Normalize the vectors to unit length
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)

    # Avoid division by zero by ensuring that norms are non-zero
    if norm_u == 0 or norm_v == 0:
        return 0

    # Compute the cosine similarity
    cosine_similarity = np.dot(u, v) / (norm_u * norm_v + epsilon)

    return cosine_similarity

## Word Analogy Task

* In the word analogy task, complete this sentence:  
    <font color='brown'>"*a* is to *b* as *c* is to **____**"</font>.

* An example is:  
    <font color='brown'> '*man* is to *woman* as *king* is to *queen*' </font>.

* You're trying to find a word *d*, such that the associated word vectors $e_a, e_b, e_c, e_d$ are related in the following manner:   
    $e_b - e_a \approx e_d - e_c$
* We will measure the similarity between $e_b - e_a$ and $e_d - e_c$ using cosine similarity.

In [None]:
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____.

    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors.

    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """

    # convert words to lowercase
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()

    # Get the word embeddings e_a, e_b and e_c (≈1-3 lines)
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]

    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:
        # to avoid best_word being one of the input words, skip the input word_c
        # skip word_c from query
        if w == word_c:
            continue

        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)
        cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)

        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w

    return best_word

Lets us test the analogies

In [None]:
triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad, word_to_vec_map)))

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> smaller


## find the nearst neighbors

In [None]:
def find_nearest_neighbors(word, word_to_vec_map, top_n=5):
    # Get the word vector for the target word
    vec = word_to_vec_map[word.lower()]

    # Calculate Euclidean distances between the target vector and all other vectors
    distances = euclidean_distances(vec.reshape(1, -1), list(word_to_vec_map.values()))

    # Get the indices of the nearest neighbors based on distances
    nearest_indices = np.argsort(distances)[0][:top_n]

    # Get the words corresponding to the nearest indices
    nearest_words = [list(word_to_vec_map.keys())[i] for i in nearest_indices]

    return nearest_words

let us find the nearst neighbors of a given word

In [None]:
words_to_try = ["hello", "food", "cat", "car", "japan"]
for word in words_to_try:
    print ('{} -> {}'.format( word, find_nearest_neighbors(word, word_to_vec_map)))

hello -> ['hello', 'goodbye', 'kiss', 'hey', 'wow']
food -> ['food', 'coffee', 'products', 'supplies', 'supply']
cat -> ['cat', 'dog', 'rabbit', 'monkey', 'cats']
car -> ['car', 'truck', 'vehicle', 'cars', 'driving']
japan -> ['japan', 'japanese', 'china', 'tokyo', 'korea']


## Clustering


We will try to cluster the entire vector space into five clusters

In [None]:
def cluster(num_clusters, word_to_vec_map):

    # get the list of word embeddings
    word_embeddings = np.array(list(word_to_vec_map.values()))

    # perform k-means clustering
    kmeans = KMeans(n_clusters=num_clusters)
    cluster_assignments = kmeans.fit_predict(word_embeddings)

    clustered_words = {}
    for i, cluster_id in enumerate(cluster_assignments):
        word = list(word_to_vec_map.keys())[i]
        if cluster_id not in clustered_words:
            clustered_words[cluster_id] = []
        clustered_words[cluster_id].append(word)

    for cluster_id, words in clustered_words.items():
        print(f"Cluster {cluster_id}: {words}")

    return

In [None]:
cluster(5, word_to_vec_map)



Cluster 3: ['cents', 'gmt', '.....', '------', '___', '7-6', 'hk', 'pesos', 'edt', 'ounce', '0.5', '0.2', '0.1', 'gallon', '0.3', 'baht', 'cdy', '3/4', 'pence', 'totaled', '5.5', '1.9', '0-0', 'rupiah', 'ringgit', '2.1', '0.4', '225', '201', '2.8', 'rupees', '3.2', '0.6', ',000', '4-6', 'euro1', '3.3', '3.6', '0.7', '3.8', '3-6', 'pct', '3.1', '0.8', '3.4', 'cdp', '2.9', '1-2', 'kph', '3.7', '4.2', 'outnumbered', '6-7', '-0', '0.9', '1500', 'us$', '4.3', '8.5', '.5', '4.6', '4.8', '4.4', '8:30', 'ftse', '5.6', '4.7', '3.9', 'dax', '5.2', '1800', '5.4', '4.1', 'celsius', '3.0', '5.3', '1.0', '5.7', 'clr', 'topix', '2-3', '5.1', '5.8', '4.9', '114', 'strikeouts', 'nz', 'kronor', '12.5', 'euro2', '165', '1-3', '1.25', 'fahrenheit', '€', 'seasonally', '6.3', '122', 'por', '6.2', 'bushel', '9.5', '320', 'industrials', '6.4', '5.9', '6.7', '----', '127', '2-6', '6.6', '146', '1600', '6.8', '185', '(800)', '6.1', '121', '7.2', 'milligrams', '5-7', 'bake', 'rn', '330', 'grafs', '132', 'decline