## Semantic Similarity

This notebooks shows the basic staps for computing sematic similarity using Word2vec.
Word2vec represents each word as a high-dimension vector of numbers (300 numbers) 
which capture relationships between words. 
Words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity.


In [7]:
import gensim.models as models
from scipy import spatial
import os, csv, numpy, pandas
import pandas as pd
import numpy as np
from sklearn import preprocessing



First, you need to find this "GoogleNews-vectors-negative300.bin" file online: 
https://github.com/mmihaltz/word2vec-GoogleNews-vectors
and download it to your working folder.

In [8]:
# Change the path location to match the file location

word2vec = models.KeyedVectors.load_word2vec_format(
    '/home1/noaherz/word2vec/GoogleNews-vectors-negative300.bin.gz', binary=True)

This example demonstrates how to retrieve 300-dimensional vector representations (embeddings) 
for a list of words using a pre-trained Word2Vec model.

We'll use four example items: "cat", "dog", "bread", and "nonsensical12". 
In practice, your code should iterate over a full list of words of interest.

Note that some words (e.g., "nonsensical12") may not be present in the Word2Vec vocabulary. 
These items should be excluded from downstream analyses.

In [9]:
# How to get 300-dimensional word vectors?

X = numpy.zeros(300)
word_ids_in_analysis_title = []
words_in_analysis_title = []
words_that_are_in_w2c = []

# Here, you would want to input all of your items (I usually do this from a csv, but having a dummy example here)
word_list = ["cat", "dog", "bread", "nonsensical12"]
word_list = np.array(word_list)

for no, item in enumerate(word_list):
    try:
        X = numpy.vstack((X,preprocessing.normalize(word2vec[item].reshape(1, -1) )))
        word_ids_in_analysis_title.append(no)
        words_in_analysis_title.append(item)
    except:
        print(item)
X = X[1:len(X)]
print(X.shape)

nonsensical12
(3, 300)


In [10]:
# How to get cosine theta similarity between two words?

print(word2vec.similarity("cat", "dog"))
print(word2vec.similarity("cat", "cat"))
print(word2vec.similarity("cat", "bread"))

0.76094574
1.0
0.14007339


In [11]:
# How to create an NxN dimensional matrix for similarity?

simmatrix = []

word_list = word_list[:3] #I'm excluding the last word now

for i in word_list:
    this_word = []
    for j in word_list:
        this_word.append(word2vec.similarity(i, j))
    simmatrix.append(this_word)

# If all your words are in word2vec, the shape of the matrix should be len(word_pool)xlen(word_pool)
# Otherwise, you'll get this error: KeyError: "word 'doughnut' not in vocabulary" and you'd need to find another item
# You can save this simmatrix for future use with np.save() function

print(len(word_list))
print(np.shape(np.array(simmatrix)))
print(simmatrix)

3
(3, 3)
[[1.0, 0.76094574, 0.14007339], [0.76094574, 0.99999994, 0.17611265], [0.14007339, 0.17611265, 1.0]]


In [12]:
# Let's do a sanity check with the word cat and it's similarity to other items

print(word_list[0].upper(), "and its similairty to other items:")
for no, i in enumerate(simmatrix[0]):    
    print(word_list[no], i)

CAT and its similairty to other items:
cat 1.0
dog 0.76094574
bread 0.14007339


In [13]:
# You can also obtain the similarities directly from the 300-dimensional word vectors

simmatrix_cos = []

for i in X:
    this_word_cos = []
    for j in X:
        this_word_cos.append(numpy.dot(i,j)/(numpy.linalg.norm(i)* numpy.linalg.norm(j)))
    simmatrix_cos.append(this_word_cos)
    
print(simmatrix_cos)

[[1.0, 0.760945707266373, 0.14007337807192446], [0.760945707266373, 1.0, 0.17611264178511324], [0.14007337807192446, 0.17611264178511324, 1.0]]


In [14]:
# And confirm the matrices are identical :) 

simmatrix == simmatrix_cos

False

### Note

The Word2Vec model distinguishes between uppercase and lowercase characters. For example, "CAT" may not be found in the model even if "cat" is present.

To ensure consistent matching, it is advisable to standardize all words to lowercase (or uppercase, as long as it is consistent) before retrieving their vector representations.

In [15]:
def case_insensitive_similarity(word1, word2, word2vec):
        """
    Compute Word2Vec similarity between two words in a case-insensitive way.
    Tries all combinations of upper/lower case and returns the maximum similarity score.

    Args:
        word1 (str): First word.
        word2 (str): Second word.
        word2vec: A trained Word2Vec model.

    Returns:
        float or None: Maximum similarity score across case combinations,
                       or None if neither word is found in the model.
    """
    similarities = []
    cases = [(word1.lower(), word2.lower()),
             (word1.lower(), word2.upper()),
             (word1.upper(), word2.lower()),
             (word1.upper(), word2.upper())]

    for w1, w2 in cases:
        try:
            similarities.append(word2vec.similarity(w1, w2))
        except KeyError:
            continue

    return max(similarities) if similarities else None

In [16]:
# Example usage of case_insensitive_similarity function
word_a = "Cat"
word_b = "dog"

# Compute similarity
similarity_score = case_insensitive_similarity(word_a, word_b, word2vec)
print(f"Similarity between '{word_a}' and '{word_b}': {similarity_score:.4f}")

Similarity between 'Cat' and 'dog': 0.7609
