## word2vec

##### Introduced in 2013, word2vec is an unsupervised neural network approach to representing words in a document. Given a text corpus, this algorithm can output a vector representation of each word.

##### It uses the context (surrounding words) of a word to generate its representation.


##### We start by loading the gensim library, which is a famous library for topic modelling and NLP applications in Python. I have already downloaded the Gutenberg data, which includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books

In [20]:
import gensim
import nltk; nltk.download("gutenberg")
nltk.download('punkt')
from nltk.corpus import gutenberg

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/tobychappell/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tobychappell/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


##### Extracting sentences from the data and running the word2vec model on it. It takes a long time so I have saved the model and will load it for future use. Notice how easy it is to run this model!

In [21]:
sentences = gutenberg.sents()
model = gensim.models.Word2Vec(sentences, min_count=1)
model.save('gutenberg_model')

##### I have saved the model as gutenberg_model, now we can load it and look at vector representation of different words

In [22]:
model = gensim.models.Word2Vec.load('gutenberg_model')

##### By default, the model is made to generate a 100 value long vector for each word. This is the output of the neural network for each word. The vector representation focuses on the context of the word across all documents, so we can even find what words it is similar to.

In [23]:
print(model.wv['fruit'])

[-0.6397712   0.9403682   0.16476774 -0.003929    0.6514387  -0.2389871
 -0.32440186  0.03787065  0.13923813 -0.12836191  0.25099906  0.50155777
 -0.9492648   0.05303491  0.69198805  0.08257672 -0.37582156  0.6737136
 -0.3205712   0.59403425  1.5550497  -1.3485568   0.62119     0.69504344
  0.8442935   0.9219864   0.57053435 -0.06519959 -0.54377174  0.27061206
 -0.12010787  0.5428703   0.4591472  -0.56457496 -0.09456757 -0.01166055
  0.04791519 -0.00829506 -0.6296058  -0.2345837  -0.73390454  0.48389575
 -0.21756802 -0.0083403  -0.6030183   0.936554    0.79389894  0.5300423
  0.15853697 -0.15191227 -0.16740079 -0.89003474  0.8811316   0.1436148
  0.22647502  0.6096216   0.4202352  -0.00232328  0.34038994  1.4338999
 -0.71681017  0.8050907  -0.15280347 -0.1471908  -0.01021012 -0.360948
  0.21517235 -0.8977882  -0.86317843 -0.30080226  0.18384501 -0.07696274
  0.742148    0.38857204  0.4275953  -1.0878211   0.8870503  -0.15645728
 -0.34996212  0.9724159  -0.60275555 -0.11190093  0.546531

In [24]:
model.wv.most_similar("fruit")

[('vine', 0.7773963212966919),
 ('bread', 0.7728155255317688),
 ('corn', 0.769756555557251),
 ('field', 0.7633217573165894),
 ('blood', 0.7580446004867554),
 ('meat', 0.7436333298683167),
 ('skin', 0.742735743522644),
 ('light', 0.7320427298545837),
 ('oil', 0.7297345399856567),
 ('flesh', 0.7287917137145996)]

##### Notice how fruit is similar to vine, corn, bread etc.. given the context of several books in the Gutenberg Project. We can additionally add or subtract the weights of words to see more words come up. In the following example, we want to find words similar to fruit without the context of flesh, meat etc.

In [25]:
model.wv.most_similar(positive=['fruit',"corn"],negative=["blood"])

[('mulberry', 0.7614861726760864),
 ('forest', 0.7536096572875977),
 ('plum', 0.7491604089736938),
 ('salt', 0.7467324733734131),
 ('vine', 0.7464162707328796),
 ('flower', 0.7348028421401978),
 ('fir', 0.7268661260604858),
 ('desert', 0.7267394065856934),
 ('wild', 0.7252177000045776),
 ('soil', 0.7245904207229614)]

##### Here's a model running for a very small data set, 2 sentences with 3 words each. Notice how I can change the size of the output vector using size, and how many words to check in the context, using window. The model isn't good, but that just tells you that text analysis needs lots of data to work.

In [26]:
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model2 = Word2Vec(sentences, min_count=1,size=3,window=3)

In [27]:
model2.wv['meow']

array([-0.00670908, -0.07993935, -0.13842449], dtype=float32)

In [28]:
model2.wv.most_similar("meow")

[('woof', 0.3258308172225952),
 ('dog', -0.43794459104537964),
 ('cat', -0.7971633672714233),
 ('say', -0.7983071208000183)]