# Week 4: Word embeddings
Identify the word vectors most similar to the vector computed using the analogy:
vec("woman") - vec("man") + vec("king")

The pretrained word embeddings are loaded from a file. The embeddings are stored in a dictionary where the key is the word and the value is the corresponding vector.

In [14]:
import numpy as np

file_path = 'glove.6B.100d.txt'

embeddings_dict = {}
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        parts = line.strip().split()
        word = parts[0]
        vector = np.array(parts[1:], dtype=float)
        embeddings_dict[word] = vector

print(f"Loaded {len(embeddings_dict)} word vectors.")

Loaded 400000 word vectors.


Define three new vectors for man, woman and king. With these the target vector is calculated.

In [15]:
man_vector = embeddings_dict.get("man")
woman_vector = embeddings_dict.get("woman")
king_vector = embeddings_dict.get("king")

target_vector = woman_vector - man_vector + king_vector

Function for finding the (top_n) amount of most similar words to the target vector. The similarity is calculated using cosine similarity.

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

def find_nearest_words(target_vec, embeddings_dict, top_n=10):
    similarities = {}
    for word, vec in embeddings_dict.items():
        sim = cosine_similarity([target_vec], [vec])[0][0]
        similarities[word] = sim

    # Sort the words by similarity score in descending order
    sorted_words = sorted(similarities.items(), key=lambda item: item[1], reverse=True)
    return sorted_words[1:top_n+1]

nearest_words = find_nearest_words(target_vector, embeddings_dict)
print("Most similar words to vec('woman') - vec('man') + vec('king'):")
for word, score in nearest_words:
    print(f"{word}: {score:.4f}")

Most similar words to vec('woman') - vec('man') + vec('king'):
queen: 0.7834
monarch: 0.6934
throne: 0.6833
daughter: 0.6809
prince: 0.6713
princess: 0.6644
mother: 0.6579
elizabeth: 0.6563
father: 0.6392
wife: 0.6352


# Conclusion
The five nearest words are queen, monarch, throne, daughter, and prince. This is because the vector operation captures the meaning and relationships between words based on how they are used in text. It changes the meaning of king by removing the male part "man" and adding the female part "woman", which points the result toward words related to a female royal figure. Words like monarch and throne show up because they are also commonly used when talking about royalty.