# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [1]:
from gensim.models import KeyedVectors
from gensim.models import Word2Vec

In [2]:
import nltk
from nltk.corpus import semcor

nltk.download('semcor')

[nltk_data] Downloading package semcor to
[nltk_data]     C:\Users\surya\AppData\Roaming\nltk_data...
[nltk_data]   Package semcor is already up-to-date!


True

In [3]:
corpus = semcor.sents()

In [4]:
loaded_model = Word2Vec(corpus, vector_size=100, window=5, min_count=2, workers=4)


## Export Model

In [None]:
# Save the model in Word2Vec 
loaded_model.wv.save_word2vec_format("semcor_word2vec.txt", binary=False)

## Load the Model

In [None]:
# Load the model 
model = KeyedVectors.load_word2vec_format("semcor_word2vec.txt", binary=False)

In [5]:
#return the vectors
model['coffee'].shape

(100,)

### Similarity

In [7]:
model.most_similar('pink')

[('Democratic', 0.9883460402488708),
 ('flat', 0.9872950911521912),
 ('silk', 0.9871722459793091),
 ('Hillsboro', 0.986665666103363),
 ('cracked', 0.9863787889480591),
 ('German', 0.9863481521606445),
 ('cotton', 0.9861451387405396),
 ('Green', 0.9861379861831665),
 ('impulses', 0.9859717488288879),
 ('Britain', 0.9852620959281921)]

In [8]:
model.most_similar('coke')

[('imperial', 0.629427969455719),
 ('Anyhow', 0.6176542043685913),
 ('against', 0.6168307065963745),
 ('envy', 0.6141186952590942),
 ('slick', 0.6137019395828247),
 ('Beckstrom', 0.605615496635437),
 ('vegetable', 0.6040188074111938),
 ('balcony', 0.5982711911201477),
 ('under', 0.596778154373169),
 ('performer', 0.5959104299545288)]

In [9]:
model.most_similar('banana')

[('few', 0.7633732557296753),
 ('serene', 0.7134090065956116),
 ('given', 0.7069408297538757),
 ('impetus', 0.704617977142334),
 ('boxcar', 0.7006807923316956),
 ("bull's-eye", 0.7005929946899414),
 ('tact', 0.6964358687400818),
 ('industrialization', 0.6943269968032837),
 ('excite', 0.6821343302726746),
 ('pupil', 0.6783847808837891)]

In [10]:
model.most_similar('language')

[('character', 0.985231339931488),
 ('activity', 0.9825721383094788),
 ('popular', 0.9822302460670471),
 ('powerful', 0.9817714691162109),
 ('mutual', 0.9816532731056213),
 ('composition', 0.981643795967102),
 ('payment', 0.980437159538269),
 ('anxiety', 0.9799349308013916),
 ('comparative', 0.9798012971878052),
 ('emotional', 0.9792170524597168)]

In [11]:
#multiple meanings....
model.most_similar("plant")

[('particle', 0.9877092242240906),
 ('stockholders', 0.9855148196220398),
 ('load', 0.9850847721099854),
 ('transfer', 0.9849750399589539),
 ('thereby', 0.9849516749382019),
 ('blood', 0.9843006730079651),
 ('missile', 0.9841243624687195),
 ('safety', 0.9841083884239197),
 ('management', 0.9840800166130066),
 ('balance', 0.9838182926177979)]

In [12]:
model.most_similar(negative='banana')

[('Cowley', 0.392792671918869),
 ('bristle', 0.38694971799850464),
 ('dashes', 0.3561495244503021),
 ('Happened', 0.3278951346874237),
 ('sniff', 0.32683637738227844),
 ('forgit', 0.32491692900657654),
 ('asterisks', 0.3121803104877472),
 ('figger', 0.30364561080932617),
 ('photographing', 0.28511226177215576),
 ('pronouncements', 0.2850145995616913)]

In [13]:
#woman + king - man
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

offices: 0.9763


In [16]:
result = model.most_similar(positive=['boy', 'sun'], negative=['day'])
print("{}: {:.4f}".format(*result[0]))

clergyman: 0.9418


### Cosine Similarity

We have talked about this in the last class.  Here we can conveniently use `distance` to find the cosine distance between two words. Note that distance = 1 - similarity.

In [17]:
w1 = "dog"
w2 = "cat"
w3 = "fruit"
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#dog is much closer to cat then dog to fruit
w1_w2_dist, w1_w3_dist

(0.10415136814117432, 0.06923520565032959)

In [18]:
w1 = "happy" # synonym 1
w2 = "cheerful" # synonym 2
w3 = "sad" # antonym
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#$w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful"!!
#those similarlity does not handle antonym....
w1_w2_dist, w1_w3_dist

(0.26817983388900757, 0.1116284728050232)

### Bias

You guys....one very important thing is that NLP models are biased.....very bad....

In [19]:
import pprint

pprint.pprint(model.most_similar(positive=['woman', 'worker'], negative=['man']))

[('roles', 0.9749366641044617),
 ('engagement', 0.9748738408088684),
 ('accompanying', 0.9724706411361694),
 ('quack', 0.9723111391067505),
 ('main', 0.9716476798057556),
 ('dirty', 0.9716293811798096),
 ('column', 0.9714378118515015),
 ('maintaining', 0.9712047576904297),
 ('original', 0.9711717367172241),
 ('excitability', 0.9701337218284607)]


In [20]:
pprint.pprint(model.most_similar(positive=['man', 'worker'], negative=['woman']))

[('person', 0.9113554358482361),
 ('foolish', 0.9037187099456787),
 ('stretch', 0.900959312915802),
 ('statement', 0.9004195928573608),
 ('sign', 0.899873673915863),
 ('move', 0.8957896828651428),
 ('child', 0.8938148617744446),
 ('word', 0.8913713693618774),
 ('explain', 0.8908658623695374),
 ('play', 0.8896649479866028)]


In [21]:
pprint.pprint(model.most_similar(positive=["woman", "doctor"], negative=["man"]))

[('Palace', 0.9634449481964111),
 ('early', 0.9596941471099854),
 ('professor', 0.95484858751297),
 ('sentence', 0.9547505974769592),
 ('announced', 0.9539727568626404),
 ('following', 0.9528915882110596),
 ('reported', 0.9523118734359741),
 ('Class', 0.9515062570571899),
 ('morning', 0.9514616131782532),
 ('Nation', 0.9512467980384827)]


### Analogy

In [22]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [31]:
analogy('bat','ball','stick')

'protect'

In [35]:
analogy('protect', 'I', 'Tom')

'she'

In [38]:
analogy('tall', 'taller', 'long')

'far'

In [39]:
analogy('good', 'fantastic', 'bad')

'chiefly'

In [40]:
analogy('bird', 'fly', 'human')

'power'

In [41]:
#which word in the list does not belong
print(model.doesnt_match("coke pepsi sprite water".split()))

coke


## Export Model

In [42]:
import pickle

In [43]:
filename = 'glove_gensim_model.pkl'
pickle.dump(model, open(filename, 'wb'))