## Similarity comparison using GloVe

The model word2vec was developed based on the principle that words appearing in similar contexts tend to have similar meanings, and is a predictive model. However, it only considers a specific window (like +/- 2 words). On the other hand, people proposed that overall co-occurrence of words across an entire corpus could be an awesome pathway to compare similarity as well. That leads to the emergence of count-based model: GloVe. GloVe constructs a global co-occurrence matrix, think of it as a massive table where each cell captures how often two words appear together in the entire text. This sounds similar to some traditional methods in NLP like LSA. However, GloVe turns its concentrate from documents to words, with more stable and robust functions, and better explanability regarding probability. Similar to word2vec, GloVe also turns words into nd vectors. While word2vec is known for its ability to capture analogies and word relationships, GloVe excels in tasks where global understanding of the language is crucial. It balances capturing both local context and global co-occurrence, making it versatile across various NLP tasks. While it’s more memory-intensive because of the massive co-occurrence matrix it builds, it’s highly efficient for learning from global statistics. It’s often pre-trained on datasets like Wikipedia or Common Crawl, so you can leverage those pre-trained vectors for smaller datasets without needing huge computational resources.

For tasks like word analogies or semantic similarity, both word2vec and GloVe perform well, though word2vec has a slight edge in capturing complex analogies (king — man + woman = queen). For tasks like sentiment analysis or tasks that benefit from global language understanding, GloVe often excels because it captures both local and global word relationships.

word2vec is suitable when considering recommendation system: it is great at understanding relationships between items by treating them like words in a sentence; chatbots: it can help by focusing on word proximity within a dialogue; machine translation: it can capture the local syntax and semantics of words. In short, if the task relies heavily on understanding relationships between words in short sequences, take word2vec. While for GloVe, it is suitable for tasks including text classification: like sentiment analysis, since GloVe captures global co-occurrence statistics, it’s particularly strong in these kinds of tasks where the overall meaning of words across large corpora is more important than just nearby words; similarity scoring: GloVe performs well in tasks that require understanding the broader semantic relationships between words; topic modeling: while trying to group documents or words into clusters based on common themes, GloVe’s global view of word co-occurrence can help create meaningful categories that go beyond local context. In all, if the task requires a more global understanding of language, GloVe is probably the better choice.

Gensim provides an efficient implementation of GloVe.

'glove-twitter-25' is trained on Twitter text (2B+ tweets, 1.2m words). Each word is represented as a 25-dimensional vector.

It's a skip-gram Word2Vec model, without including a neural network anymore, it only contains the learned vectors. What I have loaded here from Gensim is only the embeddings, not the training weights.

In [1]:
from gensim.models import KeyedVectors
glove = "glove.840B.300d.txt"

In [2]:
import numpy as np

def loadGloveModel(gloveFile):
    f = open(gloveFile,'r', encoding='utf8')
    model = {}
    for line in f:
        splitLine = line.split(' ')
        word = splitLine[0]
        embedding = np.asarray(splitLine[1:], dtype='float32')
        model[word] = embedding
    return model

In [3]:
glove_new = loadGloveModel(glove)

In [8]:
"Zurich" in glove_new

True

In [9]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [11]:
cosine_similarity(glove_new['car'], glove_new['vehicle'])

np.float32(0.7667538)

In [12]:
cosine_similarity(glove_new['lol'], glove_new['hahaha'])

np.float32(0.8785519)

In [13]:
import heapq

def most_similar(word, top_k=10):
    
    target_vec = glove_new[word]
    
    similarities = []
    for other_word, other_vec in glove_new.items():
        if other_word == word:
            continue
        sim = cosine_similarity(target_vec, other_vec)
        similarities.append((sim, other_word))
    
    top = heapq.nlargest(top_k, similarities)
    return [(w, float(s)) for s, w in top]

In [14]:
most_similar("banana", 10)

[('bananas', 0.8091707229614258),
 ('pineapple', 0.7421034574508667),
 ('coconut', 0.7215185761451721),
 ('strawberry', 0.7120676040649414),
 ('mango', 0.699190080165863),
 ('carrot', 0.6791642308235168),
 ('fruit', 0.6714836359024048),
 ('pumpkin', 0.6642457246780396),
 ('peanut', 0.6637847423553467),
 ('blueberry', 0.6517722010612488)]

In [15]:
cosine_similarity(glove_new['king'] - glove_new['man'] + glove_new['woman'], glove_new['queen'])

np.float32(0.78808445)

In [16]:
cosine_similarity(glove_new['vienna'], glove_new['alps'])

np.float32(0.42697847)

In [21]:
most_similar("alps", 10)

[('Alps', 0.7212010622024536),
 ('alpine', 0.5882586240768433),
 ('alp', 0.586816132068634),
 ('chamonix', 0.573789656162262),
 ('alpes', 0.5729764699935913),
 ('switzerland', 0.5561335682868958),
 ('austria', 0.5324950218200684),
 ('bavarian', 0.5103736519813538),
 ('dolomites', 0.5045719742774963),
 ('zermatt', 0.502839207649231)]

In [22]:
cosine_similarity(glove_new['st.'], glove_new['alps'])

np.float32(0.157181)

In [23]:
cosine_similarity(glove_new['wolfgang'], glove_new['alps'])

np.float32(0.26138788)

In [26]:
np.save('glove_new', glove_new)