# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [1]:
from gensim.test.utils import datapath
from gensim.models import KeyedVectors

# Load GloVe vectors in Word2Vec format
glove_file = datapath('glove.6B.100d.txt')
model = KeyedVectors.load_word2vec_format(glove_file, binary=False, no_header=True)

# Test
print("Vector shape for 'coffee':", model['coffee'].shape)

Vector shape for 'coffee': (100,)


### Evaluate Analogies (Semantic and Syntactic Accuracy)

In [2]:
with open('word-test_semantic.txt', 'r') as file:
    data_sem = file.readlines()

with open('word-test_syntactic.txt', 'r') as file:
    data_syn = file.readlines()

In [3]:
def clean_data(data):
    cleaned_data = []
    for line in data:
        cleaned_line = line.strip().split()
        
        if len(cleaned_line) == 4:
            cleaned_data.append(cleaned_line)
        else:
            print(f"Skipping malformed line: {line.strip()}")
    return cleaned_data

data_sem = clean_data(data_sem)
data_syn = clean_data(data_syn)

print("Semantic Analogies:", data_sem[:2])  
print("Syntactic Analogies:", data_syn[:2]) 

Skipping malformed line: : gram8-plural
Skipping malformed line: : gram9-plural-verbs
Semantic Analogies: [['Athens', 'Greece', 'Baghdad', 'Iraq'], ['Athens', 'Greece', 'Bangkok', 'Thailand']]
Syntactic Analogies: [['dancing', 'danced', 'decreasing', 'decreased'], ['dancing', 'danced', 'describing', 'described']]


In [4]:
def predict_glove_word(model, analogy):
    word1, word2, word3, _ = analogy  # Unpack the analogy
    try:
        predicted_word = model.most_similar(positive=[word3, word2], negative=[word1], topn=1)
        return predicted_word[0][0]  # Return the most similar word
    except KeyError:
        return '<UNK>'  # Return <UNK> if any word is out of vocabulary

In [5]:
def calculate_glove_accuracy(analogies, model):
    correct = 0
    total = 0
    for analogy in analogies:
        total += 1
        predicted_word = predict_glove_word(model, analogy)
        if predicted_word.lower() == analogy[3].lower():
            correct += 1

    return correct / total if total > 0 else 0

# Calculate accuracies for semantic and syntactic analogies
semantic_accuracy = calculate_glove_accuracy(data_sem, model)
syntactic_accuracy = calculate_glove_accuracy(data_syn, model)

print(f"GloVe Semantic Accuracy: {semantic_accuracy * 100:.2f}%")
print(f"GloVe Syntactic Accuracy: {syntactic_accuracy * 100:.2f}%")


GloVe Semantic Accuracy: 0.00%
GloVe Syntactic Accuracy: 61.99%


### Evaluate Correlation with Human Judgments

In [6]:
import numpy as np
from scipy.stats import spearmanr
from sklearn.metrics import mean_squared_error

def load_similarity_data(file_path):
    with open(file_path, 'r') as file:
        data = file.readlines()
    pairs = []
    scores = []
    for line in data:
        if line.strip() and len(line.strip().split()) == 3:
            word1, word2, score = line.strip().split()
            pairs.append((word1, word2))
            scores.append(float(score))
        else:
            print(f"Skipping invalid line: {line.strip()}")
    return pairs, np.array(scores)


In [7]:
def compute_cosine_similarity(word1, word2, model):
    try:
        return model.similarity(word1, word2)
    except KeyError:
        return 0  # If the word is not in the model, return a similarity of 0

def calculate_mse_and_correlation(model, similarity_files):
    all_true_scores = []
    all_predicted_scores = []
    
    for file_path in similarity_files:
        pairs, true_scores = load_similarity_data(file_path)
        predicted_scores = []
        
        for word1, word2 in pairs:
            similarity = compute_cosine_similarity(word1, word2, model)
            predicted_scores.append(similarity)
        
        all_true_scores.extend(true_scores)
        all_predicted_scores.extend(predicted_scores)
    
    all_true_scores = np.array(all_true_scores)
    all_predicted_scores = np.array(all_predicted_scores)
    
    # Compute MSE
    mse = mean_squared_error(all_true_scores, all_predicted_scores)
    
    # Compute Spearman correlation
    spearman_corr, _ = spearmanr(all_true_scores, all_predicted_scores)
    
    return mse, spearman_corr



In [8]:
similarity_files = [
    'wordsim_relatedness_goldstandard.txt',
    'wordsim_similarity_goldstandard.txt',
    'wordsim353_agreed.txt',
    'wordsim353_annotator1.txt',
    'wordsim353_annotator2.txt',
]
mse, spearman_corr = calculate_mse_and_correlation(model, similarity_files)

print(f'Mean Squared Error: {mse}')
print(f'Spearman Correlation: {spearman_corr}')

Skipping invalid line: #Word 1	Word 2	Human (mean)
Skipping invalid line: # i = identical tokens
Skipping invalid line: # s = synonym (at least in one meaning of each)
Skipping invalid line: # a = antonyms (at least in one meaning of each)
Skipping invalid line: # h = first is hyponym of second (at least in one meaning of each)
Skipping invalid line: # H = first is hyperonym of second (at least in one meaning of each)
Skipping invalid line: # S = sibling terms (terms with a common hyperonymy)
Skipping invalid line: # m = first is part of the second one (at least in one meaning of each)
Skipping invalid line: # M = second is part of the first one (at least in one meaning of each)
Skipping invalid line: # t = topically related, but none of the above
Skipping invalid line: #
Skipping invalid line: t	love	sex	6.77
Skipping invalid line: h	tiger	cat	7.35
Skipping invalid line: i	tiger	tiger	10.00
Skipping invalid line: t	book	paper	7.46
Skipping invalid line: M	computer	keyboard	7.62
Skippi