## Pre-trained word embeddings

In [12]:
import gensim.downloader as api
from scipy.spatial.distance import cosine
import numpy as np

# Pre-trained Word2Vec model from google
# This model is huge and takes a while to download 10+ mins for me (sorry)
model = api.load("word2vec-google-news-300")


In [29]:
# Calculate cosine similarity between two words using pre-trained Word2Vec embeddings
def cosine_similarity(word1, word2):
    
    # Check if both words are present in the vocabulary
    if word1 in model and word2 in model:
    
        return 1 - cosine(model[word1], model[word2])

    else:
        print("One or both of those words isn't in the data set")
        return None

In [71]:
# Calculate dissimilarity score between two words based on my own idea explained below
def dissimilarity_score(word1, word2):
    
    # Check if both words are present in the vocabulary
    if word1 in model and word2 in model:

        # Calculate absolute differences between corresponding elements
        abs_diff = np.abs(model[word1] - model[word2])
        # Calculate the average of the absolute differences
        avg_dist = np.mean(abs_diff)

        # making copies to edit
        word1_big = np.empty(())# model[word1].copy()
        word2_big = np.empty(())# model[word2].copy()

        # All words should have same length array
        # Calculate average dissimilarity of dissimilaritys above original average
        for i in range(len(model[word1])):

            if(abs(model[word1][i] - model[word2][i]) > avg_dist):

                word1_big = np.append(word1_big, model[word1][i])
                word2_big = np.append(word2_big, model[word2][i])

        # Calculate the cosign disimilarity between the biased arrays
        return cosine(word1_big, word2_big)

    else:
        print("One or both of those words isn't in the data set")
        return None

In [73]:
# Get user input
word1 = input("Enter the first word: ")
word2 = input("Enter the second word: ")

cosine_sim = cosine_similarity(word1, word2)
if cosine_sim is not None:
    print(f"Cosine Similarity between '{word1}' and '{word2}': {cosine_sim}")

dissimilarity = dissimilarity_score(word1, word2)
if dissimilarity is not None:
    print(f"Least generous interpretaion between '{word1}' and '{word2}': {dissimilarity}")


Cosine Similarity between 'plant' and 'factory': 0.6708794832229614
Least generous interpretaion between 'plant' and 'factory': 0.9422026926135413


#### Explanation of disimilarity

My method, which I am titling "least generous interpretation" is theoretically used to calculate the disimilarity between two words that *could* be similar. For example, plant and factory are a classic example of two words that can be interpreted to have very similar meanings or very different meanings depending on which definition of "plant" you use. Is it a place where manufacturing is done, or a living thing?

Least generous interpretaion takes the array of coordinates from the model and finds the average distance between them. It's odd to think of an array of length 300 as having "dimensions" like xyz but thats how it works so stick with me. So, we have our distance between out x coord, y coord, etc. Now we take the average of this distance. Finally we only consider coordinates further apart than this average. This should fix the issue that methods like euclidean distance and cosign similarity fall into where the very similar and very different interpretations pull the model in opposite directions. For example, a cosine similarity between 'plant' and 'factory' of 0.6708794832229614, an unhelpful score that doesn't really capture their similarity or lack there of.

My implementation could be optimized further but for this single case its just a proof of concept.

It seems to work given the least generous interpretaion between 'plant' and 'factory' is 0.9422026926135413.