# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [2]:
!pip3 install gensim



In [3]:
!pip3 install kagglehub



In [4]:
import numpy as np

In [5]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("danielwillgeorge/glove6b100dtxt")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/soehtetnaing/.cache/kagglehub/datasets/danielwillgeorge/glove6b100dtxt/versions/1


In [6]:
import os
from gensim.test.utils import datapath
from gensim.models import KeyedVectors

# Identify the correct file in the directory
base_path = "/Users/soehtetnaing/.cache/kagglehub/datasets/danielwillgeorge/glove6b100dtxt/versions/1"
file_name = "glove.6B.100d.txt"  # Replace with the correct file name if different
glove_file = os.path.join(base_path, file_name)

# Load the GloVe file
glove_file_path = datapath(glove_file)  # Ensure the correct path format
model = KeyedVectors.load_word2vec_format(glove_file_path, binary=False, no_header=True)

print("GloVe model loaded successfully!")


GloVe model loaded successfully!


In [7]:
import os
print(os.listdir(base_path))


['glove.6B.100d.txt']


In [8]:
#return the vectors
model['coffee'].shape

(100,)

## Testing

In [9]:
file_paths = {
    "semantic": "../word-testsemantic.v1.txt",
    "syntatic": "../word-testsyntatic.v1.txt"
}

def load_tests(file_path):
    with open(file_path, "r") as file:
        return [sent.strip() for sent in file]

# Load tests from files
semantic = load_tests(file_paths["semantic"])
syntatic = load_tests(file_paths["syntatic"])

## Semantic Accuracy

In [10]:
sem_total = len(semantic)
sem_correct = 0

for sent in semantic:
    sent = sent.lower()
    words = sent.split(" ")

    try:
        result = model.most_similar(positive=[words[1], words[2]], negative=[words[0]])[0][0]
    except:
        result = "<UNK>"

    if result == words[3]:
        sem_correct += 1

In [11]:
sem_accuracy = sem_correct / sem_total
print(f"Semantic accuracy: {sem_accuracy:2.4f}")

Semantic accuracy: 0.5316


## Syntatic Accuracy

In [12]:
syn_total = len(syntatic)
syn_correct = 0

for sent in syntatic:
    sent = sent.lower()
    words = sent.split(" ")

    try:
        result = model.most_similar(positive=[words[1], words[2]], negative=[words[0]])[0][0]
    except:
        result = "<UNK>"

    if result == words[3]:
        syn_correct += 1

In [13]:
syn_accuracy = syn_correct / syn_total
print(f"Syntatic accuracy: {syn_accuracy:2.4f}")

Syntatic accuracy: 0.5545


## Similarity Accuracy

In [14]:
file_path = "../wordsim_similarity_goldstandard.txt"

with open(file_path, 'r') as file:
            content = file.readlines()

sim_data = []

for sent in content:
    sim_data.append(sent.strip())

In [15]:
default_vector = np.zeros(model.vector_size)
try:
    result = model.get_vector('country')
except:
    result = default_vector


In [16]:
def compute_similarity(model, test_data):
    words = test_data.lower().split("\t")

    default_vector = np.zeros(model.vector_size)
    try:
        embed0 = model.get_vector(words[0].strip())
        embed1 = model.get_vector(words[1].strip())
    except:
        embed0 = default_vector
        embed1 = default_vector


    similarity_model = embed1 @ embed0.T
    similarity_provided = float(words[2].strip())

    return similarity_provided, similarity_model

In [17]:
ds_scores = []
model_scores = []
for sent in sim_data:
    ds_score, model_score = compute_similarity(model, sent)

    ds_scores.append(ds_score)
    model_scores.append(model_score)

In [18]:
from scipy.stats import spearmanr

corr = spearmanr(ds_scores, model_scores)[0]

print(f"Correlation score is {corr:2.2f}.")

Correlation score is 0.54.


In [19]:
import pickle

# Save the model
pickle.dump(model,open('../models/gensim.model','wb'))

FileNotFoundError: [Errno 2] No such file or directory: '../models/gensim.model'

In [None]:
load_model = pickle.load(open('../models/gensim.model', 'rb'))
load_model.most_similar('country')

[('nation', 0.8758763670921326),
 ('now', 0.7629706263542175),
 ('well', 0.7466367483139038),
 ('countries', 0.7459027171134949),
 ('world', 0.7444546818733215),
 ('states', 0.7400078773498535),
 ('has', 0.7369707822799683),
 ('government', 0.7319549322128296),
 ('already', 0.7283994555473328),
 ('most', 0.7282326817512512)]