# Analyzing Word Similarity with Pre-trained GloVe Embeddings

This notebook demonstrates how to use pre-trained GloVe embeddings to analyze word similarity and visualize word relationships.


## 1. Introduction to GloVe

GloVe (Global Vectors for Word Representation) is a pre-trained word embedding model that captures semantic relationships between words.


In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

## 2. Load Pre-trained GloVe Embeddings

We will load the GloVe embeddings from a file. Ensure you have downloaded the GloVe file (e.g., `glove.6B.50d.txt`).


## Downloading GloVe Embeddings

To use GloVe embeddings, you need to download the pre-trained embeddings from the official website:

1. Visit [GloVe Website](https://nlp.stanford.edu/projects/glove/).
2. Download the desired embedding file, such as `glove.6B.zip`.
3. Extract the contents of the zip file to a directory on your system.
4. Note the path to the extracted file, e.g., `glove.6B.50d.txt`.


In [4]:
# Load GloVe embeddings with error handling
import os


def load_glove_embeddings(file_path):
    if not os.path.exists(file_path):
        raise FileNotFoundError(
            f"GloVe file not found at {file_path}. Please download it from https://nlp.stanford.edu/projects/glove/"
        )
    embeddings = {}
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype="float32")
            embeddings[word] = vector
    return embeddings


# Specify the path to the GloVe file
glove_file = "glove.6B.50d.txt"
try:
    embeddings = load_glove_embeddings(glove_file)
    print(f"Loaded {len(embeddings)} word vectors.")
except FileNotFoundError as e:
    print(e)

GloVe file not found at glove.6B.50d.txt. Please download it from https://nlp.stanford.edu/projects/glove/


## 3. Find Similar Words

We will use cosine similarity to find words similar to a given word.


In [None]:
# Find similar words
def find_similar_words(word, embeddings, top_n=5):
    if word not in embeddings:
        return f"{word} not found in embeddings."
    word_vector = embeddings[word].reshape(1, -1)
    similarities = {}
    for other_word, other_vector in embeddings.items():
        if other_word != word:
            similarity = cosine_similarity(word_vector, other_vector.reshape(1, -1))[0, 0]
            similarities[other_word] = similarity
    sorted_words = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:top_n]
    return sorted_words


# Example: Find words similar to 'king'
similar_words = find_similar_words("king", embeddings)
print(similar_words)

## 4. Visualize Word Relationships

We will use t-SNE to reduce the dimensionality of word vectors and visualize their relationships.


In [None]:
# Visualize word relationships
def visualize_words(words, embeddings):
    vectors = np.array([embeddings[word] for word in words if word in embeddings])
    tsne = TSNE(n_components=2, random_state=42)
    reduced_vectors = tsne.fit_transform(vectors)
    plt.figure(figsize=(10, 8))
    for i, word in enumerate(words):
        if word in embeddings:
            plt.scatter(reduced_vectors[i, 0], reduced_vectors[i, 1], label=word)
            plt.text(reduced_vectors[i, 0] + 0.1, reduced_vectors[i, 1] + 0.1, word, fontsize=12)
    plt.title("Word Relationships")
    plt.legend()
    plt.show()


# Example: Visualize relationships among selected words
words_to_visualize = ["king", "queen", "man", "woman", "prince", "princess"]
visualize_words(words_to_visualize, embeddings)