In this notebook we learn how to use w2v, and the pretrained models obtained from GloVe. 

**Remark:** If you want to train your own GloVe model you should either use one of the wrapers that are for Python (I haven't worked with them so I can not tell you how bug free they are) or train using their implementation. See their [Github](https://github.com/stanfordnlp/GloVe) for details.

# Word2Vec

Word2Vec is part of the gensim package, and it is only one of many possible embeddings. You can find the documentation [here](https://radimrehurek.com/gensim/models/word2vec.html) We import the class 

In [3]:
import gensim.models.word2vec as w2v

To create a model we need to feed it a list of sentences, of importance for us within the signature are the the 

- Sentences = list of sentences, where each sentence is a list of tokens (words).
- size: Which corresponds to the embedding dimension, defaul=100.
- window: The number of words before and after, default=5. (8 works better)
- min_count: Ignores words that apper less than this number.


We use the same data as last time

In [7]:
books=[open('./data/dorian.txt','r'),open('./data/earnest.txt','r'),
       open('./data/essays.txt','r'),open('./data/ghost.txt','r'),
       open('./data/happy_prince.txt','r'),open('./data/house_pomegranates.txt','r'),
       open('./data/ideal_husband.txt','r'),open('./data/intentions.txt','r'),
       open('./data/lady_windermere.txt','r'),open('./data/profundis.txt','r'),
       open('./data/salome.txt','r'),open('./data/soul_of_man.txt','r'),
       open('./data/woman_of_no_importance.txt','r')]

make it into one large corpus

In [8]:
corpus = " ".join([book.read() for book in books])

and use sent_tokenizer from the nltk library to create the sentences.

In [9]:
from nltk import sent_tokenize

In [10]:
raw_sentences = sent_tokenize(corpus)

with now make each raw sentence into a list of tokens, we use the word tokenizer from nltk. 

In [16]:
from nltk import word_tokenize

In [17]:
sentences=[]
for sentence in raw_sentences:
    sentences+=[word_tokenize(sentence)]

and we can train our first model by feeding the sentences

In [20]:
model_1=w2v.Word2Vec(sentences=sentences,size=40,window=8)

In [23]:
model_1.most_similar([model_1['woman']-model_1['man']+model_1['Queen']])

[('PAGE', 0.7386135458946228),
 ('VOICE', 0.7363929152488708),
 ('PLACE', 0.7285578846931458),
 ('JOKANAAN', 0.7023457288742065),
 ('THE', 0.6995697021484375),
 ('CAPPADOCIAN', 0.6701943278312683),
 ('SYRIAN', 0.6563177108764648),
 ('YOUNG', 0.6562599539756775),
 ('HOUSE', 0.6462533473968506),
 ('OTHER', 0.6446676254272461)]

In [27]:
len(sentences)

32195

We now create another model

In [28]:
model_2=w2v.Word2Vec(sentences=sentences[:15000],size=100,window=5)

In [29]:
model_2.most_similar([model_2['woman']-model_2['man']+model_2['Queen']])

[('PGLAF', 0.7106172442436218),
 ('Brandon', 0.6801990866661072),
 ('ON', 0.6584423184394836),
 ('indicate', 0.6534081697463989),
 ('occur', 0.6410136222839355),
 ('OUT', 0.6010198593139648),
 ('HIM', 0.5976551175117493),
 ('Defects', 0.5816824436187744),
 ('linked', 0.5683529376983643),
 ('agree', 0.5644705891609192)]

You can always keep training the model, try running the following line several times and note how the most similar words change

In [38]:
model_2.train(sentences)
model_2.most_similar([model_2['woman']-model_2['man']+model_2['Queen']])

[('horrid', 0.49167507886886597),
 ('pretty', 0.46648675203323364),
 ('evening', 0.4537237882614136),
 ('wore', 0.43760383129119873),
 ('inquired', 0.41685187816619873),
 ('Is', 0.38669198751449585),
 ('heard', 0.3799837827682495),
 ('sweet', 0.3729308247566223),
 ('sang', 0.3709564805030823),
 ('heavens', 0.36998653411865234)]

There are many word2vec pretrained models out there, in what follows we load the google news one

In [2]:
from gensim.models.keyedvectors import KeyedVectors

The following line may take a while since it is putting about 3 Gigs on memory

In [41]:
model_3=KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [65]:
model_3.most_similar([model_3['Queen']-model_3['woman']+model_3['man']])

[('Queen', 0.8382776975631714),
 ('King', 0.5479298830032349),
 ('Mayfair_London_W1J', 0.5369688272476196),
 ('Queen_Elizabeth', 0.5142285823822021),
 ('Rockabilly_Wanda_Jackson', 0.5023638010025024),
 ('Mitzi_Sister', 0.4933725595474243),
 ('Lancashire_Regiment_QLR', 0.48999619483947754),
 ('Mean_Lisa_Lampanelli', 0.48064759373664856),
 ('Beehive_Hairdo', 0.48026183247566223),
 ('Osie_Ukwuoma', 0.47888004779815674)]

But we can ask for more complex analogies

In [43]:
def a_b_c(a,b,c):
    return model_3.most_similar([model_3[a]-model_3[b]+model_3[c]])

In [51]:
a_b_c('light','lamp','water')

[('water', 0.6023102402687073),
 ('sewage', 0.4261908531188965),
 ('turbid_water', 0.42406660318374634),
 ('brackish_groundwater', 0.42278411984443665),
 ('Water', 0.4190317392349243),
 ('Floridan_aquifer', 0.4161131978034973),
 ('CRMWD', 0.4132806062698364),
 ('aquifers', 0.4094894528388977),
 ('light', 0.4086752235889435),
 ('groundwater', 0.40861862897872925)]

In [67]:
a_b_c('petal','flower','leaf')

[('petal', 0.7127969861030579),
 ('leaf', 0.7053060531616211),
 ('petals', 0.5032273530960083),
 ('petiole', 0.47272592782974243),
 ('circumferentially', 0.4649199843406677),
 ('lobed', 0.45781558752059937),
 ('nanorod', 0.4557533860206604),
 ('fernlike', 0.4547785222530365),
 ('sapwood', 0.45389315485954285),
 ('leaf_axils', 0.4537096619606018)]

In [58]:
model_3.most_similar([model_3['human']-model_3['animal']])

[('human', 0.3660902976989746),
 ('mankind', 0.2857176959514618),
 ('humankind', 0.27630895376205444),
 ('humanity', 0.27607446908950806),
 ('macrocosm', 0.2735801935195923),
 ('profound', 0.2719401717185974),
 ('constrains', 0.2704422175884247),
 ('crossword_puzzling', 0.2655083239078522),
 ('littleness', 0.2646711766719818),
 ('perfections', 0.26055628061294556)]

# GloVe

In order to get the pretrained vectors from glove you should know that they are just in a txt file and they are a word followed by a space followed by the vector

In [8]:
import numpy as np

In [9]:
def loadGloveModel(gloveFile):
    print("Loading Glove Model")
    f = open(gloveFile,'r')
    model = {}
    counter=0
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = [float(val) for val in splitLine[1:]]
        model[word] = np.array(embedding)
        counter+=1
        if counter>200000:
            break
    print("Done.",len(model)," words loaded!")
    return model

In [10]:
model_4=loadGloveModel('./data/glove.42B.300d.txt')

Loading Glove Model
Done. 200001  words loaded!


In [12]:
model_4['king'].shape

(300,)

Then you can use typical techniques to find distances, a pain to implement though