# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [27]:
import numpy as np
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec


In [2]:
#connect to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#import os
import os

os.chdir('/content/drive/MyDrive/_NLP/NLP-A1-That-s-What-I-LIKE-st125553')

In [6]:
# download glove.6B.100d.txt
import kagglehub

# Download latest version
path = kagglehub.dataset_download("danielwillgeorge/glove6b100dtxt")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/danielwillgeorge/glove6b100dtxt?dataset_version_number=1...


100%|██████████| 131M/131M [00:01<00:00, 117MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/danielwillgeorge/glove6b100dtxt/versions/1


In [7]:
#you have to put this file in some python/gensim directory; just run it and it will inform where to put....

glove_file = datapath('/root/.cache/kagglehub/datasets/danielwillgeorge/glove6b100dtxt/versions/1/glove.6B.100d.txt')  #search on the google
model = KeyedVectors.load_word2vec_format(glove_file, binary=False, no_header=True)

In [12]:
#return the vectors
model['coffee'].shape

(100,)

## Testing

### Semantic Test

In [13]:
semantic_file = "data/word-test-semantic.txt"
# open file
with open(semantic_file, "r") as file:
    sem_file = file.readlines()
    #send semantic into vector

semantic = []
for sent in sem_file:
    semantic.append(sent.strip())

#semantic

In [14]:
sem_count = len(semantic)
sem_correct = 0
#sem_total

for sent in semantic:
    sent = sent.lower()
    words = sent.split(" ")

    try:
        result = model.most_similar(positive=[words[1], words[2]], negative=[words[0]])[0][0]
    except:
        result = "<UNK>"

    if result == words[3]:
        sem_correct += 1

In [16]:
sem_accuracy = sem_correct / sem_count
print(f"Semantic test result: {sem_accuracy:2.4f}")

Semantic test result: 0.5316


### Syntatic Test

In [17]:
syntatic_file = "data/word-test-syntatic.txt"
# open file
with open(syntatic_file, "r") as file:
    syn_file = file.readlines()

syntatic = []
for sent in syn_file:
    syntatic.append(sent.strip())
#syntatic

In [18]:
syn_count = len(syntatic)
syn_correct = 0

for sent in syntatic:
    sent = sent.lower()
    words = sent.split(" ")

    try:
        result = model.most_similar(positive=[words[1], words[2]], negative=[words[0]])[0][0]
    except:
        result = "<UNK>"

    if result == words[3]:
        syn_correct += 1

In [19]:
syn_accuracy = syn_correct / syn_count
print(f"Syntatic accuracy: {syn_accuracy:2.2f}")
print(f"Syntatic correct: {syn_correct}")
print(f"Syntatic count: {syn_count}")

Syntatic accuracy: 0.55
Syntatic correct: 865
Syntatic count: 1560


### Similarity Test


In [20]:
similarity_file = "data/wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"
# open file
with open(similarity_file, "r") as file:
    sim_file = file.readlines()

similarity = []
for sent in sim_file:
    similarity.append(sent.strip())
#syntatic

In [29]:
# default_vector = np.zeros(model.vector_size)
# len(default_vector)

100

In [37]:
# def similarity_test(model, test_data):
#     words = test_data.split("\t")

#     embed0 = np.array(model.get_vector(words[0].strip()))
#     embed1 = np.array(model.get_vector(words[1].strip()))

#     model_result = embed1 @ embed0.T
#     sim_result = float(words[2].strip())

#     return sim_result, model_result

In [39]:
default_vector = np.zeros(model.vector_size)

def similarity_test(model, test_data):
    words = test_data.lower().split("\t")

    default_vector = np.zeros(model.vector_size)
    try:
        embed0 = model.get_vector(words[0].strip())
        embed1 = model.get_vector(words[1].strip())
    except:
        embed0 = default_vector
        embed1 = default_vector


    similarity_model = embed1 @ embed0.T
    similarity_provided = float(words[2].strip())

    return similarity_provided, similarity_model

In [40]:
sim_scores = []
model_scores = []
for sent in similarity:
    sim_result, model_result = similarity_test(model, sent)

    sim_scores.append(sim_result)
    model_scores.append(model_result)

In [41]:
from scipy.stats import spearmanr

corr = spearmanr(sim_scores, model_scores)[0]

print(f"The correlation result is {corr:2.2f}.")

The correlation result is 0.54.


## Save the result

In [58]:
import pickle

# Save the model
pickle.dump(model,open('app/models/gensim.model','wb'))

load_model = pickle.load(open('app/models/gensim.model', 'rb'))
load_model.most_similar('james')

[('james', 0.8570922017097473),
 ('george', 0.8181617259979248),
 ('thomas', 0.8109301328659058),
 ('william', 0.8084547519683838),
 ('paul', 0.8058123588562012),
 ('henry', 0.7886716723442078),
 ('edward', 0.7804422378540039),
 ('peter', 0.7743206024169922),
 ('richard', 0.7710520625114441),
 ('robert', 0.767145037651062)]