# Exercise 2.1 : Intrinsic Evaluation

## Acknowledgement

This notebook contains components of a notebook created by Pia Sommerauer. 
The original notebook (with more slides and examples, including code to create word embeddings) can be found here:

https://github.com/PiaSommerauer/distributional_semantics

## Introduction

This notebook walks you through Exercise 2.1 of Assignment 2. The goal of this exercise is to (1) get hands on experience in comparing word embeddings, and (2) carry out an intrinsic evaluation and reflect on it.

The notebook illustrates:

* how to load a distributional semantic model in Python (with indications of where to obtain such models)
* how to calculate distances between vectors
* how to create a ranking between vector pairs based on distances and compare this to a gold ranking

In [None]:
# We will be using gensim: https://pypi.org/project/gensim/
# It provides implementations of various forms of language modeling
# including functions to create and work with word embeddings

# You can install gensim by running `pip install gensim' on your commandline 
# You can run this from the notebook by uncommenting the line below:
# %pip install gensim

# Other packages you may want to install if you do not have them installed already
# %pip install pandas
# %pip install scipy


# You will only need to install each module once, so if you end up running this notebook multiple times,
# you'll want to skip this cell or comment the packages you have installed out again.



In [1]:
import gensim
# for loading a stored model 
from gensim.models import KeyedVectors

# pandas is a useful package for dealing with data structures
import pandas as pd
import numpy as np


## Downloading a distributional semantic model

There are many high quality distributional semantic models available. They are created from large corpora and have large coverage. Creating such models requires a lot of data and computation, and the more data, the better the model.
You are therefore generally best of using one of these pretrained models.

Sommerauer's notebook walks you through creating your own models (https://github.com/PiaSommerauer/distributional_semantics)

For the exercises for this component, we will use existing models. 

***Note though: these models are big!***

The Google word embeddings created using word2vec can be found here (which we are using in the examples):

https://code.google.com/archive/p/word2vec/

Glove also has embeddings which take up a bit less space:

https://nlp.stanford.edu/projects/glove/

Note though, that you may have to apply a small conversion procedure before the gensim code works with glove embeddings (they are formatted slightly differently from the output of word2vec).

It is explained here: https://radimrehurek.com/gensim/scripts/glove2word2vec.html


In [2]:
#loading a stored model. 

# Please make sure that the path `../models/GoogleNews-vectors-negative300.bin.gz' points to the location where you stored your word embeddings 
# if you are using a non-binary model, you will need to change binary=True to binary=False
ds_model = KeyedVectors.load_word2vec_format('/Users/hernando/Desktop/NLP/Assignment2/NLP_tech_distributional_semantics/models/GoogleNews-vectors-negative300.bin.gz'
                                             , binary=True)

# a first test with the model (you can replace "student" by other words)
ds_model.most_similar("student")

[('students', 0.7294867038726807),
 ('Student', 0.6706663370132446),
 ('teacher', 0.6301366090774536),
 ('stu_dent', 0.6240992546081543),
 ('faculty', 0.6087332963943481),
 ('school', 0.6055628061294556),
 ('undergraduate', 0.6020306348800659),
 ('university', 0.6005399823188782),
 ('undergraduates', 0.5755698680877686),
 ('semester', 0.573759913444519)]

In [3]:
ds_model.most_similar("cry")

[('crying', 0.6610245704650879),
 ('cries', 0.6551704406738281),
 ('weep', 0.5748156309127808),
 ('scream', 0.5718929767608643),
 ('bawling', 0.5450262427330017),
 ('sob', 0.5303637385368347),
 ('cried', 0.5280172228813171),
 ('sisters_nieces_nephews', 0.5045138597488403),
 ('bawl', 0.5011399388313293),
 ('yell', 0.5009252429008484)]

In [4]:
# similarity: a small scale experiment. Feel free to play with this and replace the terms

cos_man_woman = ds_model.similarity('man', 'woman')
cos_man_dog = ds_model.similarity('man', 'dog')


print(f'Man and woman should be more similar than man and dog:')
if cos_man_woman > cos_man_dog:
    print('True!')
    print('man-woman', cos_man_woman)
    print('man-dog', cos_man_dog)
else:
    print('False')
    print('man-woman', cos_man_woman)
    print('man-dog', cos_man_dog)

Man and woman should be more similar than man and dog:
True!
man-woman 0.76640123
man-dog 0.3088647


In [5]:
simlex_data = pd.read_csv('/Users/hernando/Desktop/NLP/Assignment2/NLP_tech_distributional_semantics/SimLex-999/SimLex-999.txt',sep='\t')

In [6]:
simlex_data 

Unnamed: 0,word1,word2,POS,SimLex999,conc(w1),conc(w2),concQ,Assoc(USF),SimAssoc333,SD(SimLex)
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.20,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93
5,fast,rapid,A,8.75,3.32,3.07,2,5.66,1,1.68
6,happy,glad,A,9.17,2.56,2.36,1,5.49,1,1.59
7,short,long,A,1.23,3.61,3.18,2,5.36,1,1.58
8,stupid,dumb,A,9.58,1.75,2.36,1,5.26,1,1.48
9,weird,strange,A,8.93,1.59,1.86,1,4.26,1,1.30


In [7]:
ds_scores = {}
huma_scores = {}
human_scores = []
model_scores = []
k=1
for index, row in simlex_data.sort_values(by='SimLex999', ascending=False).iterrows():
    wordpair = row['word1'] + '-' + row['word2']
    human_scores.append(row['SimLex999'])
    ds_score = ds_model.similarity(row['word1'],row['word2'])
    model_scores.append(ds_score)
    ds_scores[wordpair] = ds_model.similarity(row['word1'],row['word2'])
    huma_scores[wordpair] = k
    k+=1

    
### Also saving the ranked output by the model to a file for inspection
ds_ranked_output = open('/Users/hernando/Desktop/NLP/Assignment2/NLP_tech_distributional_semantics/SimLex-999/ds_output_simlex_pairs.txt', 'w')
for index, word_pair in enumerate(sorted(ds_scores, key=ds_scores.get, reverse=True)):
    ds_ranked_output.write(str(index) + '\t' + word_pair + '\t' + str(ds_scores[word_pair]) + '\n')

In [8]:
#calculate spearman rho

from scipy.stats import spearmanr

spearmanr(human_scores, model_scores)



SpearmanrResult(correlation=0.44196551091403796, pvalue=5.068221892023142e-49)

In [9]:
model_scores

[0.9004227,
 0.49779078,
 0.26050574,
 0.81731385,
 0.73390424,
 0.5561479,
 0.38377383,
 0.6319464,
 0.568542,
 0.6589166,
 0.59902996,
 0.73785883,
 0.512326,
 0.8204511,
 0.6608624,
 0.32185915,
 0.6716382,
 0.52532834,
 0.6495278,
 0.75105584,
 0.6381518,
 0.69855225,
 0.32383794,
 0.74088913,
 0.60004014,
 0.51001453,
 0.39905614,
 0.38605824,
 0.33244228,
 0.5692449,
 0.7366201,
 0.41570213,
 0.51920044,
 0.7307427,
 0.8288328,
 0.48149034,
 0.6254826,
 0.8164579,
 0.61649424,
 0.7806021,
 0.6101885,
 0.38535285,
 0.867677,
 0.6705386,
 0.5083667,
 0.20618345,
 0.34495902,
 0.7076315,
 0.52110577,
 0.6972494,
 0.70388985,
 0.6025748,
 0.5661683,
 0.4766829,
 0.2910281,
 0.73593664,
 0.6248339,
 0.57660407,
 0.4441522,
 0.7009895,
 0.4846029,
 0.6574996,
 0.47935694,
 0.34695202,
 0.31333324,
 0.7420832,
 0.46401468,
 0.7061798,
 0.10783214,
 0.3124522,
 0.38943344,
 0.5354876,
 0.4997485,
 0.54384553,
 0.5695981,
 0.47361746,
 0.70720327,
 0.33942583,
 0.6539289,
 0.5527253,
 0.6

In [10]:
computer_rank = pd.read_csv('/Users/hernando/Desktop/NLP/Assignment2/NLP_tech_distributional_semantics/SimLex-999/ds_output_simlex_pairs.txt', sep='\t', header=None)
human_rank = pd.read_csv('/Users/hernando/Desktop/NLP/Assignment2/NLP_tech_distributional_semantics/SimLex-999.ordered.pairs.csv', sep='\t', header=None)
human_rank.columns=['rank','word','score']
computer_rank.columns=['rank','word','score']
# computer_rank['score']=computer_rank['score'].apply(lambda x : round(x,2))


In [11]:
set_1={}
for i in range(len(computer_rank)):
    set_1[computer_rank['word'].values[i]]= huma_scores[computer_rank['word'].values[i]]-(i+1)

In [19]:

sorted(set_1.items(), key=lambda d: d[1])

[('acquire-get', -773),
 ('keep-possess', -768),
 ('think-rationalize', -729),
 ('creator-maker', -724),
 ('satisfy-please', -723),
 ('reflection-image', -669),
 ('value-belief', -664),
 ('polite-proper', -660),
 ('make-construct', -634),
 ('crowd-bunch', -627),
 ('crime-violation', -626),
 ('attention-awareness', -616),
 ('money-salary', -616),
 ('recent-new', -615),
 ('god-spirit', -606),
 ('chair-bench', -602),
 ('come-attend', -598),
 ('appointment-engagement', -597),
 ('hallway-corridor', -586),
 ('log-timber', -585),
 ('proof-fact', -579),
 ('scarce-rare', -576),
 ('big-broad', -571),
 ('wood-log', -570),
 ('rabbi-minister', -569),
 ('wisdom-intelligence', -568),
 ('administration-management', -558),
 ('take-possess', -557),
 ('alcohol-gin', -555),
 ('mood-emotion', -554),
 ('crib-cradle', -552),
 ('cop-sheriff', -551),
 ('exotic-rare', -547),
 ('leader-manager', -544),
 ('take-obtain', -543),
 ('communication-language', -534),
 ('aisle-hall', -528),
 ('mob-crowd', -527),
 ('poli