# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [1]:
from gensim.models import KeyedVectors
from gensim.models import Word2Vec

In [2]:
import nltk
from nltk.corpus import semcor

nltk.download('semcor')

[nltk_data] Downloading package semcor to
[nltk_data]     C:\Users\surya\AppData\Roaming\nltk_data...
[nltk_data]   Package semcor is already up-to-date!


True

In [3]:
corpus = semcor.sents()

In [4]:
loaded_model = Word2Vec(corpus, vector_size=100, window=5, min_count=2, workers=4)


## Export Model

In [5]:
# Save the model in Word2Vec 
loaded_model.wv.save_word2vec_format("semcor_word2vec.txt", binary=False)

## Load the Model

In [6]:
# Load the model 
model = KeyedVectors.load_word2vec_format("semcor_word2vec.txt", binary=False)

In [7]:
#return the vectors
model['coffee'].shape

(100,)

### Similarity

In [8]:
model.most_similar('pink')

[('Democratic', 0.9884400367736816),
 ('flat', 0.9873430728912354),
 ('silk', 0.9872323870658875),
 ('Hillsboro', 0.9867461919784546),
 ('cracked', 0.9863520264625549),
 ('German', 0.9863277077674866),
 ('cotton', 0.9862394332885742),
 ('Green', 0.9862380623817444),
 ('impulses', 0.9859849214553833),
 ('Light', 0.9852690696716309)]

In [9]:
model.most_similar('coke')

[('imperial', 0.6293160319328308),
 ('Anyhow', 0.6182845234870911),
 ('against', 0.6167333126068115),
 ('envy', 0.6139751672744751),
 ('slick', 0.6133934259414673),
 ('Beckstrom', 0.6051849126815796),
 ('vegetable', 0.6042554378509521),
 ('balcony', 0.5981535911560059),
 ('under', 0.5966364145278931),
 ('performer', 0.5958552956581116)]

In [10]:
model.most_similar('banana')

[('few', 0.7637943029403687),
 ('serene', 0.7127214670181274),
 ('given', 0.7069243788719177),
 ('impetus', 0.705389678478241),
 ("bull's-eye", 0.7024915814399719),
 ('boxcar', 0.7004460692405701),
 ('industrialization', 0.6974120736122131),
 ('tact', 0.6971642374992371),
 ('excite', 0.679382860660553),
 ('pupil', 0.6764470338821411)]

In [11]:
model.most_similar('language')

[('character', 0.9852513670921326),
 ('activity', 0.9826105237007141),
 ('popular', 0.9822123646736145),
 ('powerful', 0.9818190932273865),
 ('mutual', 0.9816463589668274),
 ('composition', 0.9816451668739319),
 ('payment', 0.9804369211196899),
 ('anxiety', 0.9799080491065979),
 ('comparative', 0.9798842072486877),
 ('emotional', 0.9792215824127197)]

In [12]:
#multiple meanings....
model.most_similar("plant")

[('particle', 0.987708330154419),
 ('stockholders', 0.9854705333709717),
 ('load', 0.9850552678108215),
 ('thereby', 0.9849420785903931),
 ('transfer', 0.9849200248718262),
 ('blood', 0.9842037558555603),
 ('missile', 0.9841148257255554),
 ('safety', 0.9840459823608398),
 ('management', 0.9840391874313354),
 ('sewage', 0.9838245511054993)]

In [13]:
model.most_similar(negative='banana')

[('Cowley', 0.3950935900211334),
 ('bristle', 0.389311820268631),
 ('dashes', 0.3562523424625397),
 ('Happened', 0.3276137113571167),
 ('sniff', 0.3256993889808655),
 ('forgit', 0.32367298007011414),
 ('asterisks', 0.3117685914039612),
 ('figger', 0.30062246322631836),
 ('photographing', 0.2846120297908783),
 ('pronouncements', 0.2839200496673584)]

In [14]:
#woman + king - man
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

offices: 0.9763


In [15]:
result = model.most_similar(positive=['boy', 'sun'], negative=['day'])
print("{}: {:.4f}".format(*result[0]))

clergyman: 0.9420


### Cosine Similarity

We have talked about this in the last class.  Here we can conveniently use `distance` to find the cosine distance between two words. Note that distance = 1 - similarity.

In [16]:
w1 = "dog"
w2 = "cat"
w3 = "fruit"
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#dog is much closer to cat then dog to fruit
w1_w2_dist, w1_w3_dist

(0.10395544767379761, 0.0691865086555481)

In [17]:
w1 = "happy" # synonym 1
w2 = "cheerful" # synonym 2
w3 = "sad" # antonym
w1_w2_dist = model.distance(w1, w2)
w1_w3_dist = model.distance(w1, w3)

#$w_1$="happy" is closer to $w_3$="sad" than to $w_2$="cheerful"!!
#those similarlity does not handle antonym....
w1_w2_dist, w1_w3_dist

(0.2682180404663086, 0.11207568645477295)

### Bias

You guys....one very important thing is that NLP models are biased.....very bad....

In [18]:
import pprint

pprint.pprint(model.most_similar(positive=['woman', 'worker'], negative=['man']))

[('engagement', 0.975019097328186),
 ('roles', 0.9749571681022644),
 ('accompanying', 0.9724823236465454),
 ('quack', 0.9722273349761963),
 ('dirty', 0.9716994166374207),
 ('main', 0.9716706275939941),
 ('column', 0.9713506698608398),
 ('original', 0.9712011814117432),
 ('maintaining', 0.9711843729019165),
 ('excitability', 0.9700174927711487)]


In [19]:
pprint.pprint(model.most_similar(positive=['man', 'worker'], negative=['woman']))

[('person', 0.9114266037940979),
 ('foolish', 0.9037262797355652),
 ('stretch', 0.9010775089263916),
 ('statement', 0.9004921913146973),
 ('sign', 0.8998662233352661),
 ('move', 0.8957900404930115),
 ('child', 0.8938729763031006),
 ('word', 0.8913879990577698),
 ('explain', 0.8908553719520569),
 ('play', 0.889857292175293)]


In [20]:
pprint.pprint(model.most_similar(positive=["woman", "doctor"], negative=["man"]))

[('Palace', 0.9634796977043152),
 ('early', 0.9596909284591675),
 ('sentence', 0.9547947645187378),
 ('professor', 0.9547070860862732),
 ('announced', 0.9539380669593811),
 ('following', 0.9529681205749512),
 ('reported', 0.952288031578064),
 ('morning', 0.9515778422355652),
 ('Class', 0.9515630602836609),
 ('Nation', 0.9512291550636292)]


### Analogy

In [21]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [22]:
analogy('bat','ball','stick')

'protect'

In [23]:
analogy('tall', 'taller', 'long')

'far'

In [24]:
analogy('good', 'fantastic', 'bad')

'chiefly'

In [25]:
analogy('bird', 'fly', 'human')

'power'

In [26]:
#which word in the list does not belong
print(model.doesnt_match("coke pepsi sprite water".split()))

coke


## Export Model

In [27]:
import pickle

In [28]:
filename = 'glove_gensim_model.pkl'
pickle.dump(model, open(filename, 'wb'))