<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>Физтех-Школа Прикладной математики и информатики (ФПМИ) МФТИ</b></h3>

---

<h2 style="text-align: center;"><b>Word embeddings</b></h2>

Libraries we will need today (actually 2 first of them you'll need afterwards, too)

In [None]:
!pip install --upgrade nltk gensim bokeh

**NLTK** (Natural Language Toolkit) -- a library with many features to work with natural language (e.g. lemmatization, tokenization, etc.)

https://www.nltk.org

We wull use it for tokenization for now.

In [2]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

quora = list(open("./quora.txt"))
# look at 10th question from the dataset
print(tokenizer.tokenize(quora[10]))

['Which', 'brand', 'should', 'go', 'with', 'the', 'GTX', '960', 'graphic', 'card', ',', 'MSI', ',', 'Zotac', 'or', 'ASUS', '?']


Let's tokenize all the texts from quora.txt file and get array of arrays of tokens:

In [0]:
quora_tokenized = [tokenizer.tokenize(line.lower()) for line in quora]

There are many different models to train to get word embeddings (Word2Vec, GloVe, FastText, etc). For now let's try to train word2vec.

Library **gensim** will help us with that -- it provides models for Word2Vec training and further vectors' investigation

In [0]:
from gensim.models import Word2Vec
# define a model (just like torch model) with parameters
# and train it on our data
model = Word2Vec(quora_tokenized, # data for model to train on
                 size=32,         # embedding vector size
                 min_count=5,     # consider words that occured at least 5 times
                 window=3).wv     # define context as a 3-word window around the target word

Now we have trained model and can play with it:

In [0]:
# we can get a vector for any word in our vocabulary
model.get_vector('tv')

In [0]:
# and get words that have most similar vectors to vector of the given word
model.most_similar('tv')

Nice!

But actually for big tasks as text classifications or something like we would like to have better vectors. If we started training them by ourselves, we would need a LARGE text corpus and LARGE amount of memory and time. But as we are poor students by now, and need to do our homework till the next week, we don't have it all by now.

That is why there are pre-trained models or even pre-computed word vectors collections!

In [0]:
import gensim.downloader as api
model = api.load('glove-twitter-100') # list of available models: https://github.com/RaRe-Technologies/gensim-data#models

In [0]:
# who is a person who has money and related to coding, but is really stupid?
# (it's how a model trained on text corpus thinks, not me)
model.most_similar(positive=["coder", "money"], negative=["brain"])

Also using gensim we can load pre-trained vectors from vectors file. We can for example load pre-trained google word2vec vectors:

http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

In [0]:
# from gensim.models import KeyedVectors
# model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True) 

## Visualization

So let's visualize our embeddings on 2-d space. 

What we need is:
1. get embeddings for a batch of words from the gensim model
2. reduce dimensionality of those embeddings from 100-dim to 2-dim vectors
3. normalize those vectors
4. draw it on a plane!

So let's do that:

In [0]:
# let's get some most popular words from model:
n_words = 1000
words = sorted(model.vocab.keys(), 
               key=lambda word: model.vocab[word].count, # sort by number of word occurencies
               reverse=True)[:n_words]

print(words[::10])

In [0]:
# get embeddings for those words
word_vectors = [model.get_vector(word) for word in words]

For vectors visualization we need to reduce dimensionality of vectors to 2 (or 3). We'll use **PCA** for that

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [0]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(word_vectors)
word_vectors_pca = pca.transform(word_vectors)

Also let's normalize vectors we got from PCA for better visibility

We already did normalization by hand in one of the first homeworks in our course, but for now let's use sklearn api:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [0]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler().fit(word_vectors_pca)
word_vectors_pca = ss.transform(word_vectors_pca)

Finally we are to draw an (interactive!) space of embeddings, using function below:

In [0]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig


draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

Well, as we can see, this space does not look perfect, sometimes we see (it seems) different words grouped together.

That happend mostly because our way to reduce dimensionality of vectors (PCA)  is not perfect. Let's try to use different algorithm for doing this: TSNE.

TSNE is in some way creates embeddings itself, so we may hope that it will work better)

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

https://distill.pub/2016/misread-tsne/

In [0]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=50)
tsne.fit(word_vectors)
word_vectors_tsne = tsne.transform(word_vectors)

ss = StandardScaler().fit(word_vectors_tsne)
word_vectors_tsne = ss.transform(word_vectors_tsne)

In [0]:
output_notebook()
draw_vectors(word_vectors_tsne[:, 0], word_vectors_tsne[:, 1], token=words)