## Seminar 1: Fun with Word Embeddings (3 points)

Сегодня мы обучим наши собственные небольшие эмбендинги, загрузим одну из зоопарка моделей gensim и используем её для визуализации текстовых корпусов.

Все это будет происходить поверх набора данных встраивания.

__Требования:__ `pip install --upgrade nltk gensim bokeh` , но только если вы работаете локально.


<!-- Today we gonna play with word embeddings: train our own little embeddings, load one from gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

__Requirements:__  `pip install --upgrade nltk gensim bokeh` , but only if you're running locally. -->

In [1]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

"wget" �� ���� ����७��� ��� ���譥�
��������, �ᯮ��塞�� �ணࠬ��� ��� ������ 䠩���.


In [None]:
import numpy as np

with open("./quora.txt", encoding="utf-8") as file:
    data = list(file)

data[50]

__Токенизация:__ типичный первый шаг задачи НЛП — разделение необработанных данных на слова.

Текст, с которым мы работаем, находится в необработанном формате: со всеми знаками препинания и смайлами, поэтому простой str.split не подойдет.

Давайте воспользуемся __`nltk`__ — библиотекой, которая выполняет множество задач НЛП, таких как токенизация, стемминг или тегирование частей речи.


<!-- __Tokenization:__ a typical first step for an NLP task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many NLP tasks like tokenization, stemming or part-of-speech tagging. -->

In [None]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


In [None]:
# TASK: lowercase everything and extract tokens with tokenizer.
# data_tok should be a list of lists of tokens for each line in data.

# Задача 1

data_tok = [tokenizer.tokenize(tern.lower()) for tern in data]

In [None]:
# Тесты для задания

assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

In [None]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


__Векторы слов:__ существует несколько способов обучения для представления слов. Есть Word2Vec и GloVe с разными целевыми функциями. Еще есть fasttext, который использует модели уровня символов для обучения встраиванию слов.

Выбор огромен, поэтому давайте начнем с малого: __gensim__ — еще одна библиотека nlp, которая содержит множество векторных моделей, включая word2vec.


<!-- __Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings.

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec. -->

In [None]:
from gensim.models import Word2Vec
model = Word2Vec(data_tok,
                 vector_size=32,   # размер вектора внедрения
                 min_count=5,      # рассмотреть слова, которые встретились минимум 5 раз
                 window=5).wv      # контекст как окно из 5 слов вокруг целевого слова

In [None]:
# now you can get word vectors !
model.get_vector('anything')

array([-3.836635  ,  2.2468832 ,  1.7001163 ,  2.8661318 ,  2.1633523 ,
        2.915897  ,  0.14385152, -4.315439  , -0.41966793,  1.4093474 ,
       -1.8929191 ,  1.6111836 ,  3.117166  ,  1.7736423 ,  2.716526  ,
       -1.0913929 ,  0.10203376, -1.6084105 ,  1.089419  , -1.0488383 ,
       -1.6469859 ,  1.2414336 , -1.2632091 , -2.5443711 ,  1.2786397 ,
       -2.2200696 ,  1.6575338 ,  1.5933559 ,  1.1623503 , -0.8816072 ,
        1.1318763 ,  0.8764352 ], dtype=float32)

In [None]:
# or query similar words directly. Go play with it!

model.most_similar('bread')

[('rice', 0.9530147910118103),
 ('sauce', 0.9279759526252747),
 ('beans', 0.9240584373474121),
 ('butter', 0.9197137951850891),
 ('cheese', 0.9140901565551758),
 ('corn', 0.9037295579910278),
 ('banana', 0.9030702710151672),
 ('fruit', 0.8974342942237854),
 ('grass', 0.8942432999610901),
 ('pasta', 0.8937650322914124)]

### Использование предобученной модели

В случае, если мы используем огромное колл-во данных, рекомендовано использовать предобученные модели


<!-- ### Using pre-trained model

Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: wikipedia articles or twitter posts.

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise). -->

In [None]:
import gensim.downloader as api
model = api.load('glove-twitter-100')



In [None]:
# # Можем посмотреть на все модели
# api.info()

{'corpora': {'semeval-2016-2017-task3-subtaskBC': {'num_records': -1,
   'record_format': 'dict',
   'file_size': 6344358,
   'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py',
   'license': 'All files released for the task are free for general research use',
   'fields': {'2016-train': ['...'],
    '2016-dev': ['...'],
    '2017-test': ['...'],
    '2016-test': ['...']},
   'description': 'SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section “Papers” of https://github.com/RaRe-Technologies/gensim-data/issues/18.',
   'checksum': '701ea67acd82e75f95e1d8e62fb0ad29',
   'file_name': 'se

In [None]:
model.most_similar(positive=["coder", "money"], negative=["brain"])

[('broker', 0.5820155739784241),
 ('bonuses', 0.5424473285675049),
 ('banker', 0.5385112762451172),
 ('designer', 0.5197198390960693),
 ('merchandising', 0.4964233338832855),
 ('treet', 0.4922019839286804),
 ('shopper', 0.4920562207698822),
 ('part-time', 0.4912828207015991),
 ('freelance', 0.4843311905860901),
 ('aupair', 0.4796452522277832)]

### Визуализация векторов слов

Один из способов проверить, хороши ли наши векторы, — построить их график. Дело в том, что эти векторы находятся в пространстве 30D+, а мы, люди, больше привыкли к 2-3D.

К счастью, мы знаем о методах __снижения размерности__.

Давайте используем это для построения графика 1000 наиболее частых слов

<!-- ### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words -->

In [None]:
words = model.index_to_key[:1000]

print(words[::100])

['<user>', '_', 'please', 'apa', 'justin', 'text', 'hari', 'playing', 'once', 'sei']


In [None]:
# for each word, compute it's vector with model

# Не забываем перевести в array массив

word_vectors = np.array([model.get_vector(i) for i in words] )

In [None]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), 100)
assert np.isfinite(word_vectors).all()

#### Linear projection: PCA

Простейший метод линейного снижения размерности — __P__rincipial __C__omponent __A__nalysis.

В геометрических терминах PCA пытается найти оси, вдоль которых происходит большая часть дисперсии.


<!-- The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish. -->

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">

Под капотом он пытается разложить матрицу объектов-признаков $X$ на две меньшие матрицы: $W$ и $\hat W$, минимизируя _среднеквадратичную ошибку_:

<!-- Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_: -->

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Шаг 1 - Применяем PCA
pca = PCA(n_components=2) # Переводим в 2D
word_vectors_pca = pca.fit_transform(word_vectors)

# Шаг 2 - Нормализация
scaler = StandardScaler()
word_vectors_pca = scaler.fit_transform(word_vectors_pca)

In [None]:
assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

#### Let's draw it!

In [None]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [None]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Визуализация соседей с помощью t-SNE
PCA хорош, но он строго линейный и, таким образом, способен улавливать только грубую высокоуровневую структуру данных.

Если вместо этого мы хотим сосредоточиться на сохранении соседних точек рядом, мы можем использовать TSNE, который сам по себе является методом встраивания. Здесь вы можете прочитать __[подробнее о TSNE](https://distill.pub/2016/misread-tsne/)__.


<!-- ### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__. -->

In [None]:
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

# Шаг 1 применяем TSNE
tsne = TSNE()
word_tsne = tsne.fit_transform(word_vectors)

# Шаг 2 Нормализуем вектора
scaler = StandardScaler()
word_tsne = scaler.fit_transform(word_tsne)

In [None]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)

### Визуализация фраз

Word embeddings также можно использовать для представления коротких фраз. Самый простой способ — взять __среднее__ векторов для всех токенов во фразе с некоторыми весами.

Этот трюк полезен для определения того, с какими данными вы работаете: проверьте, есть ли какие-либо выбросы, кластеры или другие артефакты.

<!-- Давайте попробуем этот новый молоток на наших данных! -->

<!-- ### Visualizing phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data! -->


In [None]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros

    vector = np.zeros([model.vector_size], dtype='float32')

    # Шаг 1 - Нижний регистр
    phrase = phrase.lower()

    # Шаг 2 - Токенизация фразы
    tokens = tokenizer.tokenize(phrase)

    # Шаг 3 - Агрегируем вектора слов
    valid_vector = []

    for token in tokens:
      # Проверяем есть ли слово в словаре
      if token in model:
        valid_vector.append(model.get_vector(token))

    # Если не найдено ни одного слова, то возвращаем нулевой вектор
    if not valid_vector:
      return vector

    # Шаг 4 - Вычисляем среднее по вектору
    vector = np.mean(valid_vector, axis=0)

    return vector

In [None]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")

assert np.allclose(vector[::10],
                   np.array([ 0.31807372, -0.02558171,  0.0933293 , -0.1002182 , -1.0278689 ,
                             -0.16621883,  0.05083408,  0.17989802,  1.3701859 ,  0.08655966],
                              dtype=np.float32))

In [None]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases
phrase_vectors = np.array([get_phrase_embedding(item) for item in chosen_phrases])

In [None]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), model.vector_size)

In [None]:
# map vectors into 2d space with pca, tsne or your other method of choice
# don't forget to normalize

# Применяем TSNE
phrase_vectors_2d = TSNE().fit_transform(phrase_vectors)
# Нормализируем данные
phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)

In [None]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)

Наконец, давайте создадим простой механизм поиска «похожих вопросов» с помощью встроенных фраз, которые мы создали.

<!-- Finally, let's build a simple "similar question" engine with phrase embeddings we've built. -->

In [None]:
# compute vector embedding for all lines in data
data_vectors = np.array([get_phrase_embedding(l) for l in data])

In [None]:
from tqdm import tqdm

def cosine_similarity_manual(vec1, vec2):
    """
    Напишем функцию вычисляющую косинусное сходство между векторами
    """
    # Вычисляем скалярное произведение векторов
    dot_product = np.dot(vec1, vec2)

    # Вычисляем длины векторов
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)

    # Вычисляем косинусное сходство
    if norm_vec1 == 0 or norm_vec2 == 0:
      return 0
    else:
      return dot_product / (norm_vec1 * norm_vec2)


def find_nearest(query, k=10):
    """
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    hint: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    """

    # Шаг 1 - Преобразуем запрос в вектор
    query_vector = np.array(get_phrase_embedding(query))

    # Шаг 2 - Вычисляем косинусное сходство между векторами запроса и всеми векторами
    similarity = [cosine_similarity_manual(query_vector, vec) for vec in tqdm(data_vectors)]
    print()

    # Шаг 3 - Выбираем топ k наиболее похожих фраз
    # ВОЗВРАЩАЕТ ИНДЕКСЫ
    # Сортирует по возрастанию, выбираем k последних, сорируем по убыванию для k
    top_k = np.argsort(similarity)[-k:][::-1]

    return [data[i] for i in top_k]

In [None]:
results = find_nearest(query="How do i enter the matrix?", k=10)

print(''.join(results))

assert len(results) == 10 and isinstance(results[0], str)
assert results[0] == 'How do I get to the dark web?\n'
assert results[3] == 'What can I do to save the world?\n'

100%|██████████| 537272/537272 [00:08<00:00, 63653.68it/s]



How do I get to the dark web?
What should I do to enter hollywood?
How do I use the Greenify app?
What can I do to save the world?
How do I win this?
How do I think out of the box? How do I learn to think out of the box?
How do I find the 5th dimension?
How do I use the pad in MMA?
How do I estimate the competition?
What do I do to enter the line of event management?



In [None]:
# find_nearest(query="How does Trump?", k=10)

print(''.join(find_nearest(query="How does Trump?", k=10)))

100%|██████████| 537272/537272 [00:09<00:00, 57575.12it/s]



What does Donald Trump think about Israel?
What books does Donald Trump like?
What does Donald Trump think of India?
What does India think of Donald Trump?
What does Donald Trump think of China?
What does Donald Trump think about Pakistan?
What companies does Donald Trump own?
What does Dushka Zapata think about Donald Trump?
How does it feel to date Ivanka Trump?
What does salesforce mean?



In [None]:
# find_nearest(query="Why don't i ask a question myself?", k=10)

print(''.join(find_nearest(query="Why don't i ask a question myself?", k=10)))

100%|██████████| 537272/537272 [00:12<00:00, 44620.73it/s]



Why don't I get a date?
Why do you always answer a question with a question? I don't, or do I?
Why can't I ask a question anonymously?
Why don't I get a girlfriend?
Why don't I have a boyfriend?
I don't have no question?
Why can't I take a joke?
Why don't I ever get a girl?
Can I ask a girl out that I don't know?
Why don't I have a girlfriend?

