# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [1]:
from gensim.models import KeyedVectors
from gensim.models import Word2Vec

In [2]:
import nltk
from nltk.corpus import semcor

nltk.download('semcor')

[nltk_data] Downloading package semcor to
[nltk_data]     C:\Users\surya\AppData\Roaming\nltk_data...
[nltk_data]   Package semcor is already up-to-date!


True

In [3]:
corpus = semcor.sents()

In [5]:
loaded_model = Word2Vec(corpus, vector_size=100, window=5, min_count=2, workers=4)


## Export Model

In [6]:
# Save the model in Word2Vec 
loaded_model.wv.save_word2vec_format("semcor_word2vec.txt", binary=False)

## Load the Model

In [7]:
# Load the model 
model = KeyedVectors.load_word2vec_format("semcor_word2vec.txt", binary=False)

In [8]:
#return the vectors
model['coffee'].shape

(100,)

### Similarity

In [9]:
model.most_similar('pink')

[('surface', 0.9990283846855164),
 ('area', 0.9990200400352478),
 ('tissue', 0.9989895224571228),
 ('dark', 0.9989870190620422),
 ('open', 0.9989863634109497),
 ('areas', 0.9989855289459229),
 ('study', 0.9989827871322632),
 ('black', 0.9989815354347229),
 ('following', 0.9989778399467468),
 ('control', 0.998947024345398)]

In [10]:
model.most_similar('coke')

[('lifters', 0.8624275922775269),
 ('congressmen', 0.8607659935951233),
 ('Webster', 0.8605730533599854),
 ('tagged', 0.8601006865501404),
 ('Molly', 0.8597769737243652),
 ('raft', 0.859660267829895),
 ('Quasimodo', 0.8573532104492188),
 ('enlarge', 0.8571567535400391),
 ('Coney', 0.8570910096168518),
 ('Recently', 0.8566099405288696)]

In [11]:
model.most_similar('banana')

[('links', 0.9171523451805115),
 ('Aureomycin', 0.9163317680358887),
 ('anecdote', 0.9154340028762817),
 ('mast', 0.914303719997406),
 ('sixteenth', 0.9126900434494019),
 ('equitable', 0.9125428199768066),
 ('grams', 0.9118291735649109),
 ('males', 0.9118131399154663),
 ('supervisors', 0.9114491939544678),
 ('per', 0.9113367795944214)]

In [12]:
model.most_similar('language')

[('gave', 0.9996885061264038),
 ('known', 0.9996610283851624),
 ('sort', 0.9996585249900818),
 ('added', 0.9996577501296997),
 ('work', 0.9996575117111206),
 ('whether', 0.9996544122695923),
 ('taking', 0.9996480941772461),
 ('already', 0.9996453523635864),
 ('job', 0.9996443390846252),
 ('family', 0.9996426701545715)]

In [13]:
#multiple meanings....
model.most_similar("plant")

[('field', 0.9998185038566589),
 ('large', 0.9997922778129578),
 ('among', 0.9997918009757996),
 ('control', 0.9997820854187012),
 ('top', 0.9997804760932922),
 ('board', 0.9997727274894714),
 ('action', 0.9997720718383789),
 ('given', 0.9997695088386536),
 ('small', 0.9997676610946655),
 ('government', 0.999762237071991)]

In [14]:
model.most_similar(negative='banana')

[('Christiansen', 0.5917015075683594),
 ('providence', 0.442947655916214),
 ('Might', 0.3963949382305145),
 ('festivus', 0.3477632999420166),
 ('Flannagan', 0.3077352046966553),
 ('reactionary', 0.2959425747394562),
 ('Calenda', 0.29543668031692505),
 ('twinkle', 0.23313018679618835),
 ("Y'all", 0.19548602402210236),
 ('seedbed', 0.19411517679691315)]

In [15]:
#woman + king - man
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

m.: 0.9953


In [16]:
result = model.most_similar(positive=['boy', 'sun'], negative=['day'])
print("{}: {:.4f}".format(*result[0]))

world: 0.9993


### Cosine Similarity

We have talked about this in the last class.  Here we can conveniently use `distance` to find the cosine distance between two words. Note that distance = 1 - similarity.

In [17]:
w1 = "dog"
w2 = "cat"
w3 = "fruit"
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#dog is much closer to cat then dog to fruit
w1_w2_dist, w1_w3_dist

(0.0065582990646362305, 0.0028557777404785156)

In [18]:
w1 = "happy" # synonym 1
w2 = "cheerful" # synonym 2
w3 = "sad" # antonym
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#$w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful"!!
#those similarlity does not handle antonym....
w1_w2_dist, w1_w3_dist

(0.026732563972473145, 0.005693316459655762)

### Bias

You guys....one very important thing is that NLP models are biased.....very bad....

In [19]:
import pprint

pprint.pprint(model.most_similar(positive=['woman', 'worker'], negative=['man']))

[('two', 0.9975697994232178),
 ('p.', 0.997402548789978),
 ('50', 0.9970831871032715),
 ('2', 0.997016191482544),
 ('m.', 0.9970096349716187),
 ('three', 0.9968491196632385),
 ('marketing', 0.9968469142913818),
 ('number', 0.9968372583389282),
 ('per', 0.9968106150627136),
 ('months', 0.9967740178108215)]


In [20]:
pprint.pprint(model.most_similar(positive=['man', 'worker'], negative=['woman']))

[('thought', 0.998820424079895),
 ('voice', 0.9987970590591431),
 ('boy', 0.9986830353736877),
 ('always', 0.9986678957939148),
 ('ever', 0.9986604452133179),
 ('done', 0.9986584782600403),
 ('leave', 0.9986366629600525),
 ('way', 0.9986212253570557),
 ('hear', 0.9986149668693542),
 ('better', 0.9986088275909424)]


In [21]:
pprint.pprint(model.most_similar(positive=["woman", "doctor"], negative=["man"]))

[('50', 0.9993188977241516),
 ('number', 0.9992559552192688),
 ('20', 0.9991830587387085),
 ('feed', 0.9991568326950073),
 ('total', 0.9991306662559509),
 ('construction', 0.9991219639778137),
 ('social', 0.9991145133972168),
 ('6', 0.9991064071655273),
 ('10', 0.9990988969802856),
 ('25', 0.9990975856781006)]


### Analogy

In [22]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [23]:
analogy('bat','ball','stick')

'famous'

In [25]:
analogy('tall', 'taller', 'long')

'bold'

In [26]:
analogy('good', 'fantastic', 'bad')

'm.'

In [27]:
analogy('bird', 'fly', 'human')

'programs'

In [28]:
#which word in the list does not belong
print(model.doesnt_match("coke pepsi sprite water".split()))

coke


## Export Model

In [29]:
import pickle

In [30]:
filename = 'glove_gensim_model.pkl'
pickle.dump(model, open(filename, 'wb'))