## Seminar 1: Fun with Word Embeddings (3 points)

Today we gonna play with word embeddings: train our own little embeddings, load one from gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

__Requirements:__  `pip install --upgrade nltk gensim bokeh` , but only if you're running locally.

In [33]:
!pip install --upgrade nltk gensim bokeh



In [34]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

--2025-06-19 09:00:18--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/p0t2dw6oqs6oxpd6zz534/quora.txt?rlkey=bjupppwua4zmd4elz8octecy9&dl=1 [following]
--2025-06-19 09:00:18--  https://www.dropbox.com/scl/fi/p0t2dw6oqs6oxpd6zz534/quora.txt?rlkey=bjupppwua4zmd4elz8octecy9&dl=1
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc150890ec167403115a2f6b309a.dl.dropboxusercontent.com/cd/0/inline/Cr7Xad-uMRzeEXrigreGnMnSszLbHmt1JfSlWUyZDTYLMu0tvBBsu71krl4tmBzNrna5ITj0EIZrJbCMW9rGtxypV1mKIIE4tO5EiyWmju72oMMtYEckk621Uqv5uzFiaUA/file?dl=1# [following]
--2025-06-19 09:00:19--  https://uc150890ec167403115a2f6b309a.dl.dropboxusercontent.com/cd/0/inline/Cr7Xad-uMRz

In [35]:
import numpy as np

with open("./quora.txt", encoding="utf-8") as file:
    data = list(file)

data[50]

"What TV shows or books help you read people's body language?\n"

__Tokenization:__ a typical first step for an NLP task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many NLP tasks like tokenization, stemming or part-of-speech tagging.

In [36]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


In [37]:
data[:2]

["Can I get back with my ex even though she is pregnant with another guy's baby?\n",
 'What are some ways to overcome a fast food addiction?\n']

In [38]:
# TASK: lowercase everything and extract tokens with tokenizer.
# data_tok should be a list of lists of tokens for each line in data.

data_tok = [tokenizer.tokenize(q.lower()) for q in data]
data_tok[:2]

[['can',
  'i',
  'get',
  'back',
  'with',
  'my',
  'ex',
  'even',
  'though',
  'she',
  'is',
  'pregnant',
  'with',
  'another',
  'guy',
  "'",
  's',
  'baby',
  '?'],
 ['what',
  'are',
  'some',
  'ways',
  'to',
  'overcome',
  'a',
  'fast',
  'food',
  'addiction',
  '?']]

In [39]:
assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

In [40]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings.

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec.

In [41]:
from gensim.models import Word2Vec
model = Word2Vec(data_tok,
                 vector_size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5).wv  # define context as a 5-word window around the target word

I guess wv stands for word vectors. We use methods related to this word space (key manuipulations)

In [42]:
# now you can get word vectors !
model.get_vector('anything')

array([-3.2422278 ,  0.36424428,  0.57361   ,  3.762077  ,  0.46178004,
        3.1642017 ,  0.7178854 , -5.134465  ,  1.8177751 ,  2.011897  ,
        2.1127439 ,  2.2284112 ,  3.5102804 ,  0.349143  ,  2.014945  ,
       -1.0164939 , -0.1573654 ,  0.11654413, -0.3244354 , -2.7073488 ,
       -0.37938133, -0.65363765, -1.7557209 , -1.0690461 ,  2.628148  ,
       -3.083215  ,  0.54682606,  1.5426819 ,  0.9555216 , -0.48629424,
       -0.48633316,  1.0262762 ], dtype=float32)

In [43]:
# or query similar words directly. Go play with it!
model.most_similar('bread', topn=5)

[('rice', 0.9530435800552368),
 ('cheese', 0.9325336813926697),
 ('pasta', 0.9229212999343872),
 ('flour', 0.9171299934387207),
 ('potatoes', 0.9155970215797424)]

In [44]:
model.most_similar('python', topn=5)

[('javascript', 0.9597877264022827),
 ('java', 0.9563407897949219),
 ('php', 0.942493736743927),
 ('angularjs', 0.8980391025543213),
 ('programming', 0.8978450298309326)]

In [45]:
model.most_similar('gta', topn=5)

[('playstation', 0.8977976441383362),
 ('4k', 0.8182986974716187),
 ('vr', 0.8149808645248413),
 ('fifa', 0.8147022724151611),
 ('dota', 0.8077454566955566)]

### Using pre-trained model

Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: wikipedia articles or twitter posts.

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise).

In [46]:
import gensim.downloader as api
model = api.load('glove-twitter-100')

In [47]:
# I guess it's coder + money - brain
model.most_similar(positive=["coder", "money"], negative=["brain"])

[('broker', 0.5820155739784241),
 ('bonuses', 0.5424473285675049),
 ('banker', 0.5385112762451172),
 ('designer', 0.5197198390960693),
 ('merchandising', 0.4964233338832855),
 ('treet', 0.4922019839286804),
 ('shopper', 0.4920562207698822),
 ('part-time', 0.4912828207015991),
 ('freelance', 0.4843311905860901),
 ('aupair', 0.4796452522277832)]

In [48]:
# Ladies and gentlemen, we got it!  (king-man+woman)
model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)

[('queen', 0.7052316069602966)]

### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [49]:
words = model.index_to_key[:1000]

print(words[::100])

['<user>', '_', 'please', 'apa', 'justin', 'text', 'hari', 'playing', 'once', 'sei']


In [50]:
# for each word, compute it's vector with model
word_vectors = [model.get_vector(word) for word in words]
word_vectors = np.array(word_vectors)
word_vectors[:2]

array([[ 0.63006  ,  0.65177  ,  0.25545  ,  0.018593 ,  0.043094 ,
         0.047194 ,  0.23218  ,  0.11613  ,  0.17371  ,  0.40487  ,
         0.022524 , -0.076731 , -2.2911   ,  0.094127 ,  0.43293  ,
         0.041801 ,  0.063175 , -0.64486  , -0.43657  ,  0.024114 ,
        -0.082989 ,  0.21686  , -0.13462  , -0.22336  ,  0.39436  ,
        -2.1724   , -0.39544  ,  0.16536  ,  0.39438  , -0.35182  ,
        -0.14996  ,  0.10502  , -0.45937  ,  0.27729  ,  0.8924   ,
        -0.042313 , -0.009345 ,  0.55017  ,  0.095521 ,  0.070504 ,
        -1.1781   ,  0.013723 ,  0.17742  ,  0.74142  ,  0.17716  ,
         0.038468 , -0.31684  ,  0.08941  ,  0.20557  , -0.34328  ,
        -0.64303  , -0.878    , -0.16293  , -0.055925 ,  0.33898  ,
         0.60664  , -0.2774   ,  0.33626  ,  0.21603  , -0.11051  ,
         0.0058673, -0.64757  , -0.068222 , -0.77414  ,  0.13911  ,
        -0.15851  , -0.61885  , -0.10192  , -0.47     ,  0.19787  ,
         0.42175  , -0.18458  ,  0.080581 , -0.2

In [51]:
word_vectors.shape

(1000, 100)

In [52]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), 100)
assert np.isfinite(word_vectors).all()

#### Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [53]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
pca_model = PCA(n_components=2)
word_vectors_pca = pca_model.fit_transform(word_vectors)
word_vectors_pca.shape

# Normalization
scaler = StandardScaler(with_mean=False)  # Mean is already small
word_vectors_pca = scaler.fit_transform(word_vectors_pca)

In [54]:
assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

#### Let's draw it!

In [55]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [56]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [57]:
from sklearn.manifold import TSNE

# map word vectors onto 2d plane with TSNE. hint: don't panic it may take a minute or two to fit.
# normalize them as just lke with pca

tsne_model = TSNE()
word_tsne = tsne_model.fit_transform(word_vectors)
print(word_tsne.shape)
word_tsne = scaler.fit_transform(word_tsne)

(1000, 2)


In [58]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)

### Visualizing phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!


In [62]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros

    tokens = []
    for token in tokenizer.tokenize(phrase.lower()):
      if token in model.key_to_index.keys():
        tokens.append(model.get_vector(token))

    return sum(tokens)

In [109]:
def get_close_phrases(query, questions, top_n=10):
  query_emb = get_phrase_embedding(query)
  questions_vectors = [get_phrase_embedding(q) for q in questions]
  cosines = []
  for q_vec in questions_vectors:
    cosines.append(q_vec @ query_emb / np.linalg.norm(query_emb) / np.linalg.norm(q_vec))

  print("===DUMB SEARCH V 1.0===")
  print(f"Query: {query}")
  print(f"Related Questions:\n{'-'*60}")
  for i in np.argsort(cosines)[-top_n:][::-1]:
    print(f"{questions[i].rstrip()} | {cosines[i]:.4f}", end="\n---\n")

In [110]:
chosen_questions = data[::len(data) // 1000]
get_close_phrases("How do i enter the matrix?", chosen_questions)

===DUMB SEARCH V 1.0===
Query: How do i enter the matrix?
Related Questions:
------------------------------------------------------------
How do I learn Calculus on my own? | 0.9672
---
If I wanted to learn about the Roman Empire,what would be the best books to read? | 0.9668
---
How do I listen a song from you? | 0.9657
---
My WhatsApp chat backup got deleted from Google, I need to switch from one Android to another, the chat is there only on the phone. What should I do? | 0.9619
---
How do I run a shell script from Java code? | 0.9615
---
How do I learn to enter journal entries online in 2 weeks or so? | 0.9614
---
How do I find out if I have Siri on my phone? | 0.9609
---
R2I - How did you plan R2I from US if you own the house, i mean job search, timeline etc ? | 0.9609
---
How do you choose your first bank? | 0.9608
---
What is the best way to read a fictional book? Do you take notes when you are reading? Do you read again these notes later? | 0.9608
---


In [111]:
get_close_phrases("python or javascript?", data[:5000])

===DUMB SEARCH V 1.0===
Query: python or javascript?
Related Questions:
------------------------------------------------------------
Is java a technology or programming language? | 0.8897
---
Which language has the best future prospects: Python, Java, or JavaScript? | 0.8698
---
What are Java and Android? | 0.8624
---
Which Python module is better, envoy or subprocess module? Why? | 0.8596
---
Can I learn algorithms with JavaScript? | 0.8588
---
What is the difference between SQL Server and SQL? | 0.8533
---
I'm 15. Is there any benefit to learning HTML and CSS before learning Python? Should I learn Python after instead of JavaScript for example? | 0.8530
---
What is modular programming? | 0.8506
---
How do I build .fxml file using gradle for JAVAFX application with gradle Java plugin and application plugin by applying gradle standard source map? | 0.8499
---
What are the different programming platforms for Java? | 0.8466
---


__Now what?__
* Try running TSNE on all data, not just 1000 phrases
* See what other embeddings are there in the model zoo: `gensim.downloader.info()`
* Take a look at [FastText](https://github.com/facebookresearch/fastText) embeddings
* Optimize `find_nearest` with locality-sensitive hashing: use [nearpy](https://github.com/pixelogik/NearPy) or `sklearn.neighbors`.