## Fun with Word Embeddings

Today we are going to play with word embeddings: train our own little embedding, load one from a `gensim` model zoo and use it to visualize text corpora.

This whole thing is going to happen on top of an embedding dataset.

__Requirements:__  `pip install --upgrade nltk gensim bokeh` , but only if you're running locally. All of these are preinstalled in Colab.

In [1]:
from pathlib import Path
import numpy as np

data_path = Path('./quora.txt')

if not data_path.exists():
    !wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O $data_path
    # alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

--2021-06-16 18:16:33--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.82.18, 2620:100:6030:18::a27d:5012
Connecting to www.dropbox.com (www.dropbox.com)|162.125.82.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/obaitrix9jyu84r/quora.txt [following]
--2021-06-16 18:16:33--  https://www.dropbox.com/s/dl/obaitrix9jyu84r/quora.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc6d221a2fa93efe3d077a8238ae.dl.dropboxusercontent.com/cd/0/get/BQhqo47JyBSBeY3Kfa6FHEEjDccPvHQmnrJGp1F0knRdzh0sZeslXaj8aSsXdcxxJHLZSFwgj3sIUcSy8DE9x_D16tqdMb8X9dSoxLpOUONaGOGzRIx0gFKUR72oedne3hE5NIUREDXRdI9uBWG4tnVw/file?dl=1# [following]
--2021-06-16 18:16:34--  https://uc6d221a2fa93efe3d077a8238ae.dl.dropboxusercontent.com/cd/0/get/BQhqo47JyBSBeY3Kfa6FHEEjDccPvHQmnrJGp1F0knRdzh0sZeslXaj8aSsXdcxxJHLZSFwgj3sIUcSy8DE9x_D16tqdMb

Let's have a look at the data:

In [2]:
!head -n 20 $data_path

Can I get back with my ex even though she is pregnant with another guy's baby?
What are some ways to overcome a fast food addiction?
Who were the great Chinese soldiers and leaders who fought in WW2?
What are ZIP codes in the Bay Area?
Why was George RR Martin critical of JK Rowling after losing the Hugo award?
What can I do to improve my immune system?
How is your relationship with your mother in law?
How does one get Free PSN codes/Vita Codes?
What is your review of osquery?
How can I look smart and act smart?
Which brand should go with the GTX 960 graphic card, MSI, Zotac or ASUS?
What is the ZIP code of India?
As an Indian doctor practicing in my own private clinic. How can I earn more? What other sources of income can I use?
Is Windows the only OS that is not based on UNIX or Linux?
How can I avoid using Facebook?
How do you get a Norton Security renewal code?
Does EMC plan a lightweight web client for Documentum?
Where is the strangest place you've ever masturbated?
Tamil Nadu, I

In [8]:
!cat $data_path | wc -l

537272


In [3]:
with data_path.open(encoding='utf-8') as fp:
    data = fp.readlines()

data[50]

"What TV shows or books help you read people's body language?\n"

In [9]:
len(data)

537272

__Tokenization:__ a typical first step for an NLP task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple `str.split` won't do.

Let's use __`nltk`__ - a library that handles many NLP tasks like tokenization, stemming or part-of-speech tagging.

In [4]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


In [5]:
# TASK: lowercase everything and extract tokens with tokenizer. 
# data_tok should be a list of lists of tokens for each line in data.

data_tok = [
    tokenizer.tokenize(line.lower()) for line in data
]

In [6]:
assert all(isinstance(row, (list, tuple)) for row in data_tok), 'please convert each line into a list of tokens (strings)'
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), 'please convert each line into a list of tokens (strings)'
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), 'please make sure to lowercase the data'

In [7]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings. 

The choice is huge, so let's start someplace small: __`gensim`__ is another NLP library that features many vector-based models incuding word2vec.

In [10]:
%%time

from gensim.models import Word2Vec
model = Word2Vec(
    data_tok, 
    size=32,      # embedding vector size
    min_count=5,  # consider words that occured at least 5 times
    window=5).wv  # define context as a 5-word window around the target word

CPU times: user 1min 13s, sys: 527 ms, total: 1min 13s
Wall time: 43.7 s


In [11]:
# now you can get word vectors!
model.get_vector('anything')

array([ 1.0561653 , -1.8358444 , -1.8009949 ,  3.0314925 ,  1.4732652 ,
       -1.3237181 ,  0.9063304 ,  0.38110554,  2.0166068 ,  0.10219495,
       -3.2527244 , -0.28727165,  1.8308543 ,  1.8197738 , -2.892157  ,
        1.7162379 , -1.22475   ,  1.8028291 , -2.0988555 ,  1.249832  ,
        0.48865122, -1.85176   ,  3.4207127 ,  2.3630931 ,  3.985942  ,
        0.2967367 , -0.23675859,  2.3710923 ,  2.1950922 ,  1.1745669 ,
        2.8915288 , -0.13079987], dtype=float32)

In [12]:
# or query similar words directly. Go play with it!
model.most_similar('bread')

[('rice', 0.9634268283843994),
 ('sauce', 0.9260879755020142),
 ('fruit', 0.9256283044815063),
 ('butter', 0.9219865202903748),
 ('corn', 0.9189150333404541),
 ('banana', 0.9187854528427124),
 ('beans', 0.9159899950027466),
 ('cheese', 0.9119134545326233),
 ('potatoes', 0.9089402556419373),
 ('sugar', 0.905427098274231)]

### Using pre-trained model

Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: Wikipedia articles or Twitter posts. 

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no SMS required, promise).

In [13]:
import gensim.downloader

# Embeddings take a while to download, so we fetch a small (25D) model
# List of available models: https://github.com/RaRe-Technologies/gensim-data
# or simply try gensim.downloader.info()
model = gensim.downloader.load('glove-twitter-25')



In [14]:
model.most_similar(positive=['coder', 'money'], negative=['brain'])

[('realtor', 0.8265186548233032),
 ('gfx', 0.8249695897102356),
 ('caterers', 0.798202395439148),
 ('beatmaker', 0.7936854362487793),
 ('recruiter', 0.7892400622367859),
 ('sfi', 0.784467339515686),
 ('sosh', 0.7840631604194641),
 ('promoter', 0.7838252186775208),
 ('smallbusiness', 0.7786215543746948),
 ('promoters', 0.7764680981636047)]

### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 20D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words:

In [15]:
words = sorted(
    model.vocab.keys(), 
    key=lambda word: model.vocab[word].count,
    reverse=True)[:1000]

print(words[::100])

['<user>', '_', 'please', 'apa', 'justin', 'text', 'hari', 'playing', 'once', 'sei']


In [17]:
# For each word, compute its vector with the model
word_vectors = np.vstack([model.get_vector(word) for word in words])

In [18]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), model.vector_size)
assert np.isfinite(word_vectors).all()

#### Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src='https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png' style='width:30%'>


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to \min_{W, \hat{W}}$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [26]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
word_vectors_pca = PCA(2).fit_transform(word_vectors)
# word_vectors_pca = StandardScaler().fit_transform(word_vectors_pca)

In [22]:
assert word_vectors_pca.shape == (len(word_vectors), 2), 'there must be a 2d vector for each word'
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, 'points must be zero-centered'
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, 'points must have unit variance'

#### Let's draw it!

In [23]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    ''' draws an interactive plot for data points with auxilirary info on hover '''
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, '@' + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [27]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [28]:
from sklearn.manifold import TSNE

# map word vectors onto 2d plane with TSNE.
# hint: don't panic it may take a minute or two to fit.
# normalize them just like as with PCA
word_tsne = TSNE(2).fit_transform(word_vectors)

In [29]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)

### Visualizing phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!


In [30]:
def get_phrase_embedding(phrase):
    '''
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    '''
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros
    
    vector = np.zeros([model.vector_size], dtype=np.float32)

    phrase_words = [
        word for word in tokenizer.tokenize(phrase.lower())
        if word in model.vocab
    ]

    if phrase_words:
        vector = np.vstack([model.get_vector(word) for word in phrase_words]).mean(axis=0)
    
    return vector

In [31]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")

assert np.allclose(
    vector[::5],
    np.array([ 0.1663555 ,  0.0201319 , -0.34534982,  0.1303215 ,  0.12880965],
              dtype=np.float32))

In [32]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases
phrase_vectors = np.vstack([get_phrase_embedding(phrase) for phrase in chosen_phrases])

In [33]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), model.vector_size)

In [34]:
# Map vectors into 2D space with PCA, T-SNE, or your other method of choice.
# Don't forget to normalize.

phrase_vectors_2d = TSNE(2).fit_transform(phrase_vectors)

In [35]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)

Finally, let's build a simple 'similar question' engine with phrase embeddings we've built.

In [36]:
# compute vector embedding for all lines in data
data_vectors = np.vstack([get_phrase_embedding(line) for line in data])

In [37]:
assert isinstance(data_vectors, np.ndarray)
assert np.isfinite(data_vectors).all()
assert data_vectors.shape == (len(data), model.vector_size)

In [39]:
from sklearn.metrics.pairwise import cosine_similarity

def find_nearest(query, k=10):
    '''
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    Hint 1: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    Hint 2: it might be a good idea to normalize data_vectors only once rather than for each query
    Hint 3: you will need to handle zero embedding vectors somehow
    '''

    query_embedding = get_phrase_embedding(query)
    
    sims = cosine_similarity(data_vectors, query_embedding[np.newaxis]).squeeze(axis=1)
    indices = np.argsort(sims)[-k:][::-1]
    
    return [data[idx] for idx in indices]  # top-k lines starting from most similar

In [40]:
results = find_nearest(query='How do I enter the Matrix?', k=10)

print(''.join(results))

assert len(results) == 10 and isinstance(results[0], str)
assert results[0] == 'How do I get to the dark web?\n'
assert results[3] == 'How do you print the gridlines in Excel 2007?\n'

How do I get to the dark web?
What universal remote do I need and how do I set it up to a Blaupunkt TV?
How do I connect the ASUS_T00Q to my PC?
How do you print the gridlines in Excel 2007?
How do you print the gridlines in Excel 2010?
How do you print the gridlines in Excel 2003?
I would like to create a new website. What do I have to do?
How do I get the new Neko Atsume wallpapers? How do they work?
I want to experience the 4G network. Do I need to change my SIM card from 3G to 4G?
What do I have to do to sell my photography?



In [41]:
find_nearest(query='How does Trump?', k=10)

['What does Donald Trump think about Israel?\n',
 'Who or what is Donald Trump, really?\n',
 'Donald Trump: Why are there are so many questions about Donald Trump on Quora?\n',
 'Does anyone like Trump and Clinton?\n',
 'What does Cortana mean?\n',
 'Did Bill Gates outcompete and outsmart IBM? Why? How?\n',
 'Why and how is Bill Gates so rich?\n',
 'What does Donald Trump think about Pakistan?\n',
 'What do you think about Trump and Obama?\n',
 'Who and what is Quora?\n']

In [42]:
find_nearest(query="Why don't I ask a question myself?", k=10)

["Why don't my parents listen to me?\n",
 "Why don't people appreciate me?\n",
 "Why she don't interact with me?\n",
 "Why don't I get a date?\n",
 "Why don't I get a girlfriend?\n",
 "Why don't I have a girlfriend?\n",
 "Why don't I have a boyfriend?\n",
 "Why don't I like people touching me?\n",
 "Why can't I ask a question anonymously?\n",
 "Why don't you use Facebook much?\n"]

__Now what?__
* Try running TSNE on all data, not just 1000 phrases
* See what other embeddings are there in the model zoo: `gensim.downloader.info()`
* Take a look at [FastText](https://github.com/facebookresearch/fastText) embeddings
* Optimize find_nearest with locality-sensitive hashing: use [nearpy](https://github.com/pixelogik/NearPy) or `sklearn.neighbors`.

## Acknowledgements

This notebook is based on the [notebook for Seminar 1](https://github.com/yandexdataschool/nlp_course/blob/28a92e376f5229fe57f6e704c9f927909265b1e2/week01_embeddings/seminar.ipynb) in the YSDA NLP course.