## Seminar 1: Fun with Word Embeddings (3 points)

Today we gonna play with word embeddings: train our own little embedding, load one from   gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

__Requirements:__  `pip install --upgrade nltk gensim bokeh` , but only if you're running locally.

In [1]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

/bin/sh: wget: command not found


In [2]:
import numpy as np

data = list(open("/Users/aleksandr/Desktop/nlp sem1/quora.txt"))
data[50]

"What TV shows or books help you read people's body language?\n"

In [26]:
fun = list(open("/Users/aleksandr/Desktop/nlp sem1/comments.tsv"))

__Tokenization:__ a typical first step for an nlp task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many nlp tasks like tokenization, stemming or part-of-speech tagging.

In [3]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


In [27]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(fun[50]))

['The', 'linked', 'article', 'is', 'incorrect', '-', 'the', '45', '%', 'figure', 'refers', 'to', 'European', 'popular', '(', 'EU', 'citizens', ')', 'support', 'for', 'Romania', "'", 's', 'accession', '.', 'The', '70', '%', 'figure', 'refers', 'to', 'Romanian', 'citizens', 'support', 'for', 'EU', 'accession', '.', '22', ':', '49', ',', '18', 'June', '"']


In [4]:
# TASK: lowercase everything and extract tokens with tokenizer. 
# data_tok should be a list of lists of tokens for each line in data.

data_tok = [tokenizer.tokenize(stroka.lower())
           for stroka in data
           ]

In [28]:
# TASK: lowercase everything and extract tokens with tokenizer. 
# data_tok should be a list of lists of tokens for each line in data.

fun_tok = [tokenizer.tokenize(stroka.lower())
           for stroka in fun
           ]

In [5]:
assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

In [29]:
assert all(isinstance(row, (list, tuple)) for row in fun_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in fun_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

In [6]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


In [30]:
print([' '.join(row) for row in fun_tok[:2]])

['should_ban comment_text', "0 the picture on the article is not of the actor t . r . knight who is the subject of the article . even a basic google search turns up pictures of the t . r . knight about which this article is written . the photo continually being added to the article is , again , not a photo of this article ' s subject . because this photo is not relevant to the subject of the article , it should and will be removed . before my edit or those of others who have made the same edit are reverted , please provide a legitimate reason for this image ' s inclusion in this article . if a fair - use image of t . r . knight , the actor and subject of this article , can be found , then by all means it can be added , but random images should not be used as placeholders ."]


__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings. 

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec.

In [7]:
from gensim.models import Word2Vec
model = Word2Vec(data_tok, 
                 size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5).wv  # define context as a 5-word window around the target word

In [32]:
model2 = Word2Vec(fun_tok, 
                 size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5).wv 

In [8]:
# now you can get word vectors !
model.get_vector('anything')

array([ 0.24363032, -1.4633673 ,  0.7739472 ,  1.5586306 , -1.9316609 ,
       -2.515978  ,  0.56748325,  0.7523742 , -1.8805869 ,  1.4899508 ,
       -3.0239196 ,  0.93387145, -0.3821936 ,  2.3217874 , -1.7319866 ,
        1.225801  , -2.482453  , -0.21388696, -0.00845222, -0.79191124,
        1.0161442 , -5.375405  , -2.6267896 , -1.7173827 , -0.19293964,
       -0.16364563, -2.2288933 ,  3.124374  , -0.66984534,  1.5793252 ,
       -3.9895086 , -1.8568345 ], dtype=float32)

In [33]:
model2.get_vector('anything')

array([-0.59749156, -0.32966417, -0.12685   , -0.37127084, -0.12289326,
        0.04762203, -0.522953  , -0.14657001, -0.71218586, -0.10315451,
       -0.06352486, -0.10757888, -0.07487528, -0.4138363 , -0.11967427,
        0.2936894 ,  0.1740208 ,  0.35964048,  0.7251248 , -0.23498173,
       -0.13719594,  0.029267  , -0.02109293, -0.09748347, -0.02970869,
        0.00824126,  0.2603816 ,  0.2184948 , -0.42382818, -0.20892261,
        0.23178776, -0.11588732], dtype=float32)

In [10]:
# or query similar words directly. Go play with it!
model.most_similar('something')

[('anything', 0.9187126159667969),
 ('nothing', 0.8796188831329346),
 ('everything', 0.8601042032241821),
 ('everyone', 0.8070036172866821),
 ('it', 0.76922607421875),
 ('nobody', 0.7472575902938843),
 ('everybody', 0.7231194972991943),
 ('someone', 0.7135562896728516),
 ('things', 0.712706446647644),
 ('him', 0.7047159671783447)]

In [37]:
model2.most_similar('lol')

[('big', 0.9969120621681213),
 ('*', 0.9964557886123657),
 ('getting', 0.9964392185211182),
 ('facts', 0.9964371919631958),
 ('asshole', 0.9963735342025757),
 ('yes', 0.9962925910949707),
 ('ம', 0.9962581396102905),
 ('five', 0.9962255358695984),
 ('ு', 0.9959668517112732),
 ('everyone', 0.9959409236907959)]

In [12]:
model.most_similar('cat')

[('bitsat', 0.7998226881027222),
 ('xat', 0.7979891300201416),
 ('sat', 0.7926539182662964),
 ('gmat', 0.78383469581604),
 ('percentile', 0.7837237119674683),
 ('gre', 0.7825124263763428),
 ('aipmt', 0.7782830595970154),
 ('toefl', 0.7765294313430786),
 ('ielts', 0.7764476537704468),
 ('jee', 0.7487885355949402)]

In [13]:
model.most_similar('love')

[('feelings', 0.8214847445487976),
 ('happy', 0.7961887121200562),
 ('friends', 0.761274516582489),
 ('dream', 0.7451550960540771),
 ('talk', 0.7391836643218994),
 ('girlfriend', 0.7274563312530518),
 ('friendship', 0.7273074388504028),
 ('talking', 0.725829005241394),
 ('crush', 0.7251476049423218),
 ('loved', 0.7251184582710266)]

In [41]:
model2.most_similar(positive=["love"], negative=["she"])

[('slap', 0.5344714522361755),
 ('.""""""""', 0.5006606578826904),
 ('dick', 0.4522987902164459),
 ('fggt', 0.4338076114654541),
 ('bitch', 0.4170183539390564),
 ('thank', 0.3874306082725525),
 ('before', 0.3706950843334198),
 ('suck', 0.36481615900993347),
 ('you', 0.3410276174545288),
 ('racist', 0.31269142031669617)]

### Using pre-trained model

Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: wikipedia articles or twitter posts. 

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise).

In [11]:
import gensim.downloader as api
model = api.load('glove-twitter-100')

In [15]:
model.most_similar(positive=["sex"], negative=["love"])

[('consensual', 0.49876511096954346),
 ('freeporn', 0.48580604791641235),
 ('اسلحة', 0.4827759861946106),
 ('ميدانية', 0.48087844252586365),
 ('обязать', 0.47639912366867065),
 ('นอนกลางวัน', 0.470897376537323),
 ('jtrouveça', 0.4651491940021515),
 ('السداد', 0.4637960195541382),
 ('انشقاقات', 0.46149832010269165),
 ('hesabinda', 0.46128562092781067)]

In [17]:
model.most_similar('relation')

[('relations', 0.6948665380477905),
 ('discrete', 0.6438723802566528),
 ('communication', 0.6397368907928467),
 ('regard', 0.6220911741256714),
 ('dispute', 0.6104023456573486),
 ('vie', 0.6092134714126587),
 ("d'une", 0.6085295677185059),
 ('relationship', 0.6016569137573242),
 ('longue', 0.6004472970962524),
 ('discrète', 0.5978078246116638)]

### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [18]:
words = sorted(model.vocab.keys(), 
               key=lambda word: model.vocab[word].count,
               reverse=True)[:1000]

print(words[::100])

['<user>', '_', 'please', 'apa', 'justin', 'text', 'hari', 'playing', 'once', 'sei']


In [19]:
# for each word, compute it's vector with model
word_vectors = np.array([model.get_vector(word) for word in words])

In [20]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), 100)
assert np.isfinite(word_vectors).all()

#### Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [21]:
from sklearn.decomposition import PCA

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
word_vectors_pca = PCA(n_components=2).fit_transform(word_vectors)

# and maybe MORE OF YOUR CODE here :)

In [22]:
assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

AssertionError: points must have unit variance

#### Let's draw it!

In [23]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [24]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [31]:
from sklearn.manifold import TSNE

# map word vectors onto 2d plane with TSNE. hint: use verbose=100 to see what it's doing.
# normalize them as just lke with pca


word_tsne = TSNE().fit_transform(word_vectors)

In [32]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)

### Visualizing phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!


In [None]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros
    
    vector = np.zeros([model.vector_size], dtype='float32')
    
    # YOUR CODE
    
    return vector
        
    

In [33]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")

assert np.allclose(vector[::10],
                   np.array([ 0.31807372, -0.02558171,  0.0933293 , -0.1002182 , -1.0278689 ,
                             -0.16621883,  0.05083408,  0.17989802,  1.3701859 ,  0.08655966],
                              dtype=np.float32))

NameError: name 'get_phrase_embedding' is not defined

In [None]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases
phrase_vectors = # YOUR CODE

In [None]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), model.vector_size)

In [None]:
# map vectors into 2d space with pca, tsne or your other method of choice
# don't forget to normalize

phrase_vectors_2d = TSNE(verbose=1000).fit_transform(phrase_vectors)

phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)

In [None]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)

Finally, let's build a simple "similar question" engine with phrase embeddings we've built.

In [None]:
# compute vector embedding for all lines in data
data_vectors = np.array([get_phrase_embedding(l) for l in data])

In [None]:
def find_nearest(query, k=10):
    """
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    hint: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    """
    # YOUR CODE
    
    return <YOUR CODE: top-k lines starting from most similar>

In [None]:
results = find_nearest(query="How do i enter the matrix?", k=10)

print(''.join(results))

assert len(results) == 10 and isinstance(results[0], str)
assert results[0] == 'How do I get to the dark web?\n'
assert results[3] == 'What can I do to save the world?\n'

In [None]:
find_nearest(query="How does Trump?", k=10)

In [None]:
find_nearest(query="Why don't i ask a question myself?", k=10)

__Now what?__
* Try running TSNE on all data, not just 1000 phrases
* See what other embeddings are there in the model zoo: `gensim.downloader.info()`
* Take a look at [FastText](https://github.com/facebookresearch/fastText) embeddings
* Optimize find_nearest with locality-sensitive hashing: use [nearpy](https://github.com/pixelogik/NearPy) or `sklearn.neighbors`.