# Week 4 - Exploring Word Embeddings with GloVe

### 1. Introduction & Objectives

In this notebook, we’ll explore GloVe word embeddings and use them to perform some simple yet powerful word vector operations. Our main objectives are:

- Learn how to use pre-trained GloVe embeddings for natural language processing tasks.
- Get a better understanding of how word vectors capture relationships between words.

We’ll start by loading the GloVe embeddings from a local file and extracting the word vectors for three words: **"man"**, **"woman"**, and **"king"**. Then, we’ll calculate the result of the operation:

`vec("woman") - vec("man") + vec("king")`

Finally, we’ll find the word closest to this result using cosine similarity. This exercise demonstrates how pre-trained embeddings can represent word relationships in a meaningful way, making it easier to analyze and interpret language.

### 2. Data Understanding

In this notebook, we’ll work with the **GloVe 6B 100d embeddings**, a set of pre-trained word vectors provided by the Stanford NLP Group. These embeddings are derived from a massive corpus of 6 billion tokens, including text from Wikipedia and Gigaword. Each word is represented as a dense vector with 100 dimensions, capturing its semantic meaning and relationships with other words.

The GloVe 6B embeddings include a vocabulary of 400,000 words, making them versatile for a wide range of natural language processing tasks. For this assignment, we’ll specifically use the `glove.6B.100d.txt` file, which contains the 100-dimensional embeddings.


#### 2.1 Importing Libraries and Loading GloVe Embeddings

We begin by importing the necessary libraries and loading the GloVe embeddings from the `glove.6B.100d.txt` file. The embeddings will be stored in a dictionary, with words as keys and their corresponding vectors as values.

In [1]:
# Importing Libraries
import numpy as np

Next, we will load the GloVe embeddings from the `glove.6B.100d.txt` file. The function `load_glove_embeddings` reads the file line by line, extracts the word and its vector, and stores them in a dictionary.

In [2]:
# Load the GloVe embeddings
def load_glove_embeddings(path):
    embeddings = {}

    with open(path, 'r', encoding='utf-8') as file:
        for line in file:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector

    return embeddings


# File path for the GloVe embeddings
file_path = '../../Inputs/glove.6B.100d.txt'

# Load the GloVe embeddings
glove_embeddings = load_glove_embeddings(file_path)

The GloVe embeddings have been successfully loaded into the `glove_embeddings` dictionary. We can now proceed to extract the word vectors for the specified words.

### 3. Extracting Word Vectors

Let's now extract the word vectors for the words **woman**, **man**, and **king**. We'll use these vectors to perform the necessary operations and analyze the relationships between the words.

In [3]:
# Extract word vectors
woman_vect = glove_embeddings['woman']
man_vect = glove_embeddings['man']
king_vect = glove_embeddings['king']

The word vectors have been extracted. Next, we'll calculate the result of the operation `vec("man") - vec("woman") + vec("king")`.

In [4]:
# Calculate the result of the operation
result_vect = woman_vect - man_vect + king_vect

We have successfully calculated the resulting vector. Next, we'll find the word closest to this result using cosine similarity.

### 4. Finding the Closest Word

To find the word closest to the result vector, we'll calculate the cosine similarity between the result vector and each word vector in the GloVe embeddings. The word with the highest cosine similarity will be considered the closest to the result.

In [5]:
# Calculate cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)

    # Avoid division by zero
    if norm_vec1 == 0 or norm_vec2 == 0:
        return 0

    return dot_product / (norm_vec1 * norm_vec2)


# Find the closest word
def find_closest_word(result_vec, embeddings):
    closest_word = None
    max_similarity = -1

    for word, vector in embeddings.items():
        similarity = cosine_similarity(result_vec, vector)

        if similarity > max_similarity:
            closest_word = word
            max_similarity = similarity

    return closest_word


# Assuming `result_vect` and `glove_embeddings` are defined
closest_word = find_closest_word(result_vect, glove_embeddings)

The word closest to the result vector has been successfully identified. We can move onto checking the results.

#### 4.1 Finding the Closest 5 Words

Let's find the 5 words closest to the result vector using cosine similarity. This will give us a better understanding of the relationships captured by the word vectors.

In [6]:
# Find the 5 closest words
def find_closest_words(result_vec, embeddings, n=5):
    closest_words = []

    for word, vector in embeddings.items():
        similarity = cosine_similarity(result_vec, vector)
        closest_words.append((word, similarity))

    # Sort the words by similarity
    closest_words.sort(key=lambda x: x[1], reverse=True)

    return closest_words[:n]


# Find the 5 closest words to the result vector
closest_words = find_closest_words(result_vect, glove_embeddings, n=5)

### 5. Findings

Let's print the result of the operation and the word closest to the result vector.

In [7]:
# Print the result and the closest word
print(f"Result of the operation: {result_vect}")

Result of the operation: [-0.10231996 -0.81294     0.10211001  0.985924    0.34218282  1.09095
 -0.48913    -0.05616698 -0.21029997 -1.02996    -0.86851     0.36786997
  0.01960999  0.59259    -0.231901   -1.016919   -0.012184   -1.17194
 -0.52329     0.60645    -0.98537004 -1.001028    0.48913902  0.630072
  0.58224     0.15908998  0.43684998 -1.25351     0.97054005 -0.06552899
  0.733763    0.44219002  1.2091839   0.19698    -0.15948     0.34364
 -0.46222997  0.33772     0.14792703 -0.24959499 -0.77093005  0.522717
 -0.12830001 -0.91881    -0.01755    -0.44041002 -0.52656496  0.33734798
  0.60639    -0.45067    -0.04158002  0.08408298  1.31456     0.67737997
 -0.24316001 -2.071      -0.60648996  0.19710997  0.63567     0.07819999
  0.49161002  0.08172001  0.708557    0.201938    0.5155501  -0.23025298
 -0.40473     0.39212003 -0.5093     -0.139153    0.21609999 -0.628671
  0.08894001  0.49167    -0.06637001  0.76095    -0.19442001  0.41131
 -1.04476    -0.14801991 -0.098355   -0.2511

Here we can see the resulting vector obtained by performing the operation.

Next, let's print the word closest to this result vector.

In [8]:
print(f"Closest word to the result vector: {closest_word}")

Closest word to the result vector: king


As observed, the word closest to the result vector is **king**. This highlights the ability of word embeddings to capture semantic relationships between words. Through simple vector operations, we can uncover meaningful insights and identify words that are closely related or similar in meaning.

In this case, the result vector aligns most closely with **king** due to how the GloVe embeddings represent semantic relationships in the vector space. While the operation aimed to approximate the vector for **queen** (`woman - man + king`), the closest match is **king** because:
- The embeddings may have clustered related words like "king" and "queen" in close proximity, with "king" being slightly closer to the resulting vector.
- Cosine similarity measures direction rather than exact magnitude, and the resulting vector's direction may align more closely with "king."
- GloVe embeddings, trained on general corpora, may not perfectly distinguish nuanced relationships such as gender-specific roles.

Finally, let's print the 5 words closest to the result vector using cosine similarity to further analyze the embedding space.

In [9]:
# Print the 5 closest words
print("5 closest words to the result vector:")
for word, similarity in closest_words:
    print(f"{word}: {similarity:.4f}")

5 closest words to the result vector:
king: 0.8552
queen: 0.7834
monarch: 0.6934
throne: 0.6833
daughter: 0.6809


The 5 words closest to the result vector have been successfully identified:

- **king**: 0.8552
- **queen**: 0.7834
- **monarch**: 0.6934
- **throne**: 0.6833
- **daughter**: 0.6809

This analysis offers valuable insights into the relationships between words and demonstrates how word embeddings effectively capture semantic connections.

#### Why These Words Are Closest
1. **King**:
   - The closest match, **king**, aligns with the resulting vector because the operation starts with "king" and applies the gender transformation captured by `woman - man`. The GloVe embeddings cluster related concepts (royalty and gendered roles), causing "king" to remain very close.

2. **Queen**:
   - As the ideal result, "queen" is the second-closest word. It reflects the success of the analogy transformation, demonstrating how GloVe embeddings encode the relationship between male and female roles within the same domain (royalty).

3. **Monarch**:
   - This word is semantically related to both "king" and "queen," as it represents a gender-neutral term for royalty. Its proximity highlights the embeddings' ability to generalize hierarchical relationships.

4. **Throne**:
   - The word "throne" is closely associated with royalty and leadership. Its presence among the closest words illustrates how embeddings capture contextual relationships (e.g., objects and concepts related to royalty).

5. **Daughter**:
   - Although slightly less related to royalty directly, "daughter" appears close due to the gender transformation in the operation. It reflects how GloVe embeddings cluster words based on gender-related themes.

Overall, these results showcase the power of word embeddings in capturing both direct semantic relationships (e.g., king and queen) and broader contextual associations (e.g., monarch and throne).

### 6. Conclusions

In this analysis, we explored the power of word embeddings, specifically GloVe, to capture semantic relationships and analogies between words. By performing simple vector arithmetic on word embeddings, we demonstrated how meaningful patterns and relationships can emerge.

#### Key Findings:
1. **Closest Word Analysis**:
   - The closest word to the resulting vector from the operation `woman - man + king` was **king**, with **queen** as the second closest word. This highlights the strength of word embeddings in representing relationships, though subtle limitations exist in perfectly capturing gender-specific analogies.
   - Other closely related words, such as **monarch** and **throne**, show how embeddings cluster related concepts, such as royalty and leadership, within the same semantic space.

2. **Semantic Representation**:
   - Word embeddings like GloVe effectively encode hierarchical, gendered, and contextual relationships between words. For example, the relationships between "man," "woman," "king," and "queen" reflect both gender and role-specific transformations.

3. **Utility of Word Embeddings**:
   - This exercise demonstrates how word embeddings can be used to uncover insights into language semantics, solve analogies, and build downstream natural language processing models.

#### Reflections:
While the embeddings successfully identified "king" as the closest word, the slightly lower ranking of "queen" suggests room for improvement in embedding models' ability to distinguish nuanced relationships. Using higher-dimensional embeddings or alternative models like FastText or contextual embeddings (e.g., BERT) might yield even better results.

#### Future Directions:
This exploration lays the foundation for leveraging word embeddings in more advanced tasks, such as semantic similarity, text classification, or contextual understanding in natural language processing pipelines. Continued development of embedding models will further enhance their ability to capture language nuances.

In summary, this analysis demonstrates the potential of word embeddings to transform linguistic data into structured, interpretable representations that are invaluable in understanding and processing natural language.