**Seminar 1: Fun with Word Embeddings**
Today we are gonna play with word embeddings: train our own little embeddings, load one from gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

Requirements: pip install --upgrade nltk gensim bokeh , but only if you're running locally.

In [None]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt # questions on quora (people ask)
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

In [14]:
with open('quora.txt', encoding="utf-8") as f:
    data = list(f)

data[10]

'Which brand should go with the GTX 960 graphic card, MSI, Zotac or ASUS?\n'

In [17]:
len(data) # number of sentences of our data

537272

**Tokenization**: a typical first step for an NLP task is to split raw data into words. The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use nltk - a library that handles many NLP tasks like tokenization, stemming or part-of-speech tagging.



In [67]:
import numpy as np
from nltk.tokenize import  WordPunctTokenizer

tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[11]))

['What', 'is', 'the', 'ZIP', 'code', 'of', 'India', '?']


In [27]:
# TASK: lowercase everything and extract tokens with tokenizer.
# data_tok should be a list of lists of tokens for each line in data.

data_tok = [tokenizer.tokenize(i.lower()) for i in data]

In [32]:
data_tok[11]

['what', 'is', 'the', 'zip', 'code', 'of', 'india', '?']

In [29]:
assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

In [40]:
print([' '.join(row) for row in data_tok[:1]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?"]


**Word vectors**: as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings.

The choice is huge, so let's start someplace small: gensim is another nlp library that features many vector-based models incuding word2vec.

In [42]:
from gensim.models import Word2Vec

model = Word2Vec(
    data_tok,
    vector_size=32, # embedding vector size
    window=5, # define context as a 5-word window around the target word
    min_count=5 # consider words that occured at least 5 times (отбрасывает самые встречаемые слова >=5 (здесь))
).wv

In [43]:
# now you can get word vectors !

model.get_vector('everything')

array([-1.5262333 ,  2.1145294 ,  0.81044066,  3.0935519 ,  1.2803497 ,
        1.9093763 , -1.45781   , -5.0202804 , -0.02520465,  0.9337871 ,
        1.4796574 ,  0.30828473,  1.1404164 ,  0.33315268,  2.5724516 ,
       -2.7664094 ,  0.8886677 , -1.3391675 , -0.2556421 , -0.67231786,
       -1.3501204 , -0.29044598, -0.43982   , -1.4151559 , -0.18784814,
       -0.94874924,  0.08696299,  1.6951393 ,  1.0397223 ,  0.2070252 ,
       -0.6495235 ,  0.25012478], dtype=float32)

In [44]:
# or query similar words directly. Go play with it!

model.most_similar('bread')

[('rice', 0.9542890787124634),
 ('sauce', 0.9335342645645142),
 ('butter', 0.9222909212112427),
 ('cheese', 0.9192749857902527),
 ('beans', 0.9166175127029419),
 ('fruit', 0.909808874130249),
 ('pasta', 0.9089231491088867),
 ('potato', 0.899134635925293),
 ('wine', 0.8983882665634155),
 ('vodka', 0.8936192989349365)]

In [45]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300') # 300-dim vectors for each token



In [None]:
vec_everything = wv['everything']
#vec_everything

In [47]:
wv.most_similar('bread')

[('butter', 0.6417260766029358),
 ('rye_sourdough', 0.6290417313575745),
 ('breads', 0.6243128180503845),
 ('loaf', 0.6184971332550049),
 ('flour', 0.615212619304657),
 ('baladi_bread', 0.6061378121376038),
 ('loaves', 0.6045446991920471),
 ('raisin_bread', 0.5843341946601868),
 ('stale_bread', 0.5802395343780518),
 ('wheaten_flour', 0.5785929560661316)]

as we can see: **the more size of corpora, the better**

**Using pre-trained model**
Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: wikipedia articles or twitter posts.

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise).

In [49]:
import gensim.downloader as api

model = api.load('glove-twitter-100')



In [None]:
#api.info() # all the corporas of gensim

In [66]:
model.most_similar(positive=["queen", 'man'], negative=["woman"])

[('king', 0.6708807945251465),
 ('aka', 0.6319987177848816),
 ('fan', 0.6134569048881531),
 ('rock', 0.6048598289489746),
 ('sorry', 0.5993564128875732),
 ('song', 0.5918704271316528),
 ('jessie', 0.5864764451980591),
 ('boy', 0.5861239433288574),
 ('punk', 0.5848831534385681),
 ("'s", 0.5833798050880432)]

**Visualizing phrases**
Word embeddings can also be used to represent short phrases. The simplest way is to take **an average of vectors** for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!

In [132]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros

    vector = np.zeros([model.vector_size], dtype='float32')
    phrase = tokenizer.tokenize(phrase.lower())
    vocab = model.key_to_index.keys()
    for word in phrase:
        if word in vocab:
            vector += np.array(model.get_vector(word))

    return vector

In [134]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")
vector[::10]

array([  3.8168845 ,  -0.3069805 ,   1.1199516 ,  -1.2026184 ,
       -12.334427  ,  -1.994626  ,   0.61000896,   2.1587763 ,
        16.44223   ,   1.038716  ], dtype=float32)

In [142]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases
phrase_vectors = [get_phrase_embedding(phrase) for phrase in chosen_phrases]

In [149]:
query = get_phrase_embedding('What is the best programming language?')

In [153]:
#cosines = [vec @ query / np.linalg.norm(vec) / np.linalg.norm(query) for vec in phrase_vectors]
cosines = [np.nan_to_num(vec @ query / np.linalg.norm(vec) / np.linalg.norm(query)) for vec in phrase_vectors]
# the second one works better (there isn't any trash, like 'AALKFLKSAJFJL')

In [154]:
for i in np.argsort(cosines)[-5:]:
    print(chosen_phrases[i])

What is the best modern programming language?

What is the best programming language in 2016?

Which is the best programming language?

What is the best IoT programming language?

What is the best programming language?

