## Бонусное задание: word2vec на PyTorch

Как вы уже могли заметить, идея, лежащая в основе [word2vec](https://arxiv.org/pdf/1310.4546), достаточно общая. В данном задании вы реализуете его самостоятельно.

Дисклеймер: не стоит удивляться тому, что реализация от `gensim` (или аналоги) обучается быстрее и работает точнее. Она использует множество доработок и ускорений, а также достаточно эффективный код. Ваша задача добиться промежуточных результатов за разумное время.

P.s. Как ни странно, GPU в этом задании нам не потребуется.

__Requirements:__ if you're running locally, in the selected environment run the following command:

```pip install --upgrade nltk bokeh umap-learn```


In [None]:
#!pip install --upgrade nltk bokeh umap-learn

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting bokeh
  Downloading bokeh-3.6.0-py3-none-any.whl.metadata (12 kB)
Collecting umap-learn
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting click (from nltk)
  Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.9.11-cp312-cp312-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.66.6-py3-none-any.whl.metadata (57 kB)
Collecting Jinja2>=2.9 (from bokeh)
  Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting contourpy>=1.2 (from bokeh)
  Using cached contourpy-1.3.0-cp312-cp312-win_amd64.whl.metadata (5.4 kB)
Collecting numpy>=1.16 (from bokeh)
  Downloading numpy-2.1.3-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting pandas>=1.2 (from bokeh)
  Downloading pandas-

In [1]:
import os
import itertools
import random
import string
from collections import Counter
from itertools import chain

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import umap
from IPython.display import clear_output
from matplotlib import pyplot as plt
from nltk.tokenize import WordPunctTokenizer
from torch.optim.lr_scheduler import ReduceLROnPlateau, StepLR
from tqdm.auto import tqdm as tqdma

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt -nc
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

import urllib.request
if not os.path.exists('./quora.txt'):
    urllib.request.urlretrieve('https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1', './quora.txt')

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
data = list(open("./quora.txt", encoding="utf-8"))
data[50]

"What TV shows or books help you read people's body language?\n"

Токенизация – первый шаг.
Тексты, с которыми мы работаем, включают в себя пунктуацию, смайлики и прочие нестандартные токены, так что простой `str.split` не подойдет.

Обратимся к `nltk` - библиотеку, нашла широкое применеие в области NLP.

In [4]:
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


In [5]:
string.punctuation
str.maketrans("", "", string.punctuation)

{33: None,
 34: None,
 35: None,
 36: None,
 37: None,
 38: None,
 39: None,
 40: None,
 41: None,
 42: None,
 43: None,
 44: None,
 45: None,
 46: None,
 47: None,
 58: None,
 59: None,
 60: None,
 61: None,
 62: None,
 63: None,
 64: None,
 91: None,
 92: None,
 93: None,
 94: None,
 95: None,
 96: None,
 123: None,
 124: None,
 125: None,
 126: None}

In [6]:
# list of lines in tokens
data_tok = [
    tokenizer.tokenize(
        line.translate(str.maketrans("", "", string.punctuation)).lower()
    )
    for line in data
]
data_tok = [x for x in data_tok if len(x) >= 3]


In [7]:

print(f'{data_tok[:5]}=')

[['can', 'i', 'get', 'back', 'with', 'my', 'ex', 'even', 'though', 'she', 'is', 'pregnant', 'with', 'another', 'guys', 'baby'], ['what', 'are', 'some', 'ways', 'to', 'overcome', 'a', 'fast', 'food', 'addiction'], ['who', 'were', 'the', 'great', 'chinese', 'soldiers', 'and', 'leaders', 'who', 'fought', 'in', 'ww2'], ['what', 'are', 'zip', 'codes', 'in', 'the', 'bay', 'area'], ['why', 'was', 'george', 'rr', 'martin', 'critical', 'of', 'jk', 'rowling', 'after', 'losing', 'the', 'hugo', 'award']]=


Несколько проверок:

In [8]:
assert all(
    isinstance(row, (list, tuple)) for row in data_tok
), "please convert each line into a list of tokens (strings)"
assert all(
    all(isinstance(tok, str) for tok in row) for row in data_tok
), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all("a" <= x.lower() <= "z" for x in tok)
assert all(
    map(lambda l: not is_latin(l) or l.islower(), map(" ".join, data_tok))
), "please make sure to lowercase the data"

Ниже заданы константы ширины окна контекста и проведена предобработка для построения skip-gram модели.

In [9]:
min_count = 5
window_radius = 5

In [10]:
vocabulary_with_counter = Counter(chain.from_iterable(data_tok))
word_count_dict = dict()
for word, counter in vocabulary_with_counter.items():
    if counter >= min_count:
        word_count_dict[word] = counter

vocabulary = set(word_count_dict.keys())
del vocabulary_with_counter

In [11]:
word_to_index = {word: index for index, word in enumerate(vocabulary)}
index_to_word = {index: word for word, index in word_to_index.items()}

Пары `(слово, контекст)` на основе доступного датасета сгенерированы ниже.

In [12]:
# List of indeces of context word pairs
context_pairs = []

for text in data_tok:
    for i, central_word in enumerate(text):
        context_indices = range(
            max(0, i - window_radius), min(i + window_radius, len(text))
        )
        for j in context_indices:
            if j == i:
                continue
            context_word = text[j]
            if central_word in vocabulary and context_word in vocabulary:
                context_pairs.append(
                    (word_to_index[central_word], word_to_index[context_word])
                )

print(f"Generated {len(context_pairs)} pairs of target and context words.")
print(f"{context_pairs[:10]=}")

Generated 40220313 pairs of target and context words.
context_pairs[:10]=[(22347, 16517), (22347, 1995), (22347, 9791), (22347, 24510), (16517, 22347), (16517, 1995), (16517, 9791), (16517, 24510), (16517, 9001), (1995, 22347)]


In [13]:
print(context_pairs[:10])

[(22347, 16517), (22347, 1995), (22347, 9791), (22347, 24510), (16517, 22347), (16517, 1995), (16517, 9791), (16517, 24510), (16517, 9001), (1995, 22347)]


#### Подзадача №1: subsampling
Для того, чтобы сгладить разницу в частоте встречаемсости слов, необходимо реализовать механизм subsampling'а.
Для этого вам необходимо реализовать функцию ниже.

Вероятность **исключить** слово из обучения (на фиксированном шаге) вычисляется как
$$
P_\text{drop}(w_i)=1 - \sqrt{\frac{t}{f(w_i)}},
$$
где $f(w_i)$ – нормированная частота встречаемости слова, а $t$ – заданный порог (threshold).

In [14]:
from typing import Dict

def subsample_frequent_words(word_counts: Dict[str, int], t: float = 1e-5) -> Dict[str, float]:
    """
    Вычисляет вероятность оставить каждое слово на основе сабсэмплинга частых слов.

    Args:
        word_counts (Dict[str, int]): Словарь с количеством повторений слов {слово: количество}.
        t (float): Порог частоты для нормировки.

    Returns:
        Dict[str, float]: Словарь с вероятностями оставить слово {слово: вероятность}.
    """
    # Общее количество слов
    total_count = sum(word_counts.values())

    # Нормированные вероятности оставить слово
    keep_probabilities = {}
    for word, count in word_counts.items():
        # Нормированная частота слова
        freq = count / total_count
        # Вероятность оставить слово
        keep_prob = min(1.0, (t / freq) ** 0.5)
        # keep_prob = min(1, (freq/t + 1)**0.5 * t/freq)
        # keep_probabilities[word] = round(keep_prob, 3)
        keep_probabilities[word] = keep_prob

    return keep_probabilities

# Пример использования
word_counts = {'the': 5000, 'is': 1000, 'apple': 50}
print(subsample_frequent_words(word_counts))

{'the': 0.0034785054261852176, 'is': 0.007778174593052023, 'apple': 0.034785054261852175}


#### Подзадача №2: negative sampling
Для более эффективного обучения необходимо не только предсказывать высокие вероятности для слов из контекста, но и предсказывать низкие для слов, не встреченных в контексте. Для этого вам необходимо вычислить вероятност использовать слово в качестве negative sample, реализовав функцию ниже.

В оригинальной статье предлагается оценивать вероятность слов выступать в качестве negative sample согласно распределению $P_n(w)$
$$
P_n(w) = \frac{U(w)^{3/4}}{Z},
$$

где $U(w)$ распределение слов по частоте (или, как его еще называют, по униграммам), а $Z$ – нормировочная константа, чтобы общая мера была равна $1$.

In [15]:
def get_negative_sampling_prob(word_count_dict: Dict[str, int]):
    """
    Calculates the negative sampling probabilities for words based on their frequencies.

    This function adjusts the frequency of each word raised to the power of 0.75, which is
    commonly used in algorithms like Word2Vec to moderate the influence of very frequent words.
    It then normalizes these adjusted frequencies to ensure they sum to 1, forming a probability
    distribution used for negative sampling.

    Parameters:
    - word_count_dict (dict): A dictionary where keys are words and values are the counts of those words.

    Returns:
    - dict: A dictionary where keys are words and values are the probabilities of selecting each word
            for negative sampling.

    Example:
    >>> word_counts = {'the': 5000, 'is': 1000, 'apple': 50}
    >>> get_negative_sampling_prob(word_counts)
    {'the': 0.298, 'is': 0.160, 'apple': 0.042}
    """

    neg_sample_probs = {word: count**(3/4) for word, count in word_count_dict.items() }
    # print(f'{neg_sample_probs=}')
    Z = sum(neg_sample_probs.values())
    neg_sample_probs = {word: freq/Z for word, freq in neg_sample_probs.items()}

    return neg_sample_probs

word_counts = {'the': 5000, 'is': 1000, 'apple': 50}

print(get_negative_sampling_prob(word_counts))
print(sum(get_negative_sampling_prob(word_counts).values()))

{'the': 0.751488398196177, 'is': 0.2247474520689081, 'apple': 0.023764149734914898}
1.0


In [16]:
keep_prob_dict = subsample_frequent_words(word_count_dict)
assert keep_prob_dict.keys() == word_count_dict.keys()

In [17]:
negative_sampling_prob_dict = get_negative_sampling_prob(word_count_dict)
assert negative_sampling_prob_dict.keys() == negative_sampling_prob_dict.keys()
assert np.allclose(sum(negative_sampling_prob_dict.values()), 1)

Для удобства, преобразуем полученные словари в массивы (т.к. все слова все равно уже пронумерованы).

In [18]:
keep_prob_array = np.array(
    [keep_prob_dict[index_to_word[idx]] for idx in range(len(word_to_index))]
)
negative_sampling_prob_array = np.array(
    [
        negative_sampling_prob_dict[index_to_word[idx]]
        for idx in range(len(word_to_index))
    ]
)

Если все прошло успешно, функция ниже поможет вам с генерацией подвыборок (батчей).

In [19]:
from typing import List


def generate_batch_with_neg_samples(
    context_pairs: List[tuple[int, int]],
    batch_size: int,
    keep_prob_array: np.ndarray,
    word_to_index: Dict[str, int],
    num_negatives: int,
    negative_sampling_prob_array: np.ndarray,
):
    """Generates batch of context pairs randomlly choosing from context_pairs.
    For each pair in batch generates n=num_nugatives negative words (indecies of words)

    Args:
        context_pairs (List[tuple[int, int]]): _description_
        batch_size (int): _description_
        keep_prob_array (np.ndarray): _description_
        word_to_index (Dict[str, int]): _description_
        num_negatives (int): number of negative samples
        negative_sampling_prob_array (np.ndarray): batch_size stacked arrays of size (num_negatives, ). Final size (batch_size, num_negative)

    Returns:
        _type_: _description_
    """
    batch = []
    neg_samples = []

    while len(batch) < batch_size:
        center, context = random.choice(context_pairs)
        if random.random() < keep_prob_array[center]:
            batch.append((center, context))
            neg_sample = np.random.choice(
                range(len(negative_sampling_prob_array)),
                size=num_negatives,
                p=negative_sampling_prob_array,
            )
            neg_samples.append(neg_sample)
    batch = np.array(batch)
    neg_samples = np.vstack(neg_samples)
    return batch, neg_samples

In [20]:
batch_size = 4
num_negatives = 15
batch, neg_samples = generate_batch_with_neg_samples(
    context_pairs,
    batch_size,
    keep_prob_array,
    word_to_index,
    num_negatives,
    negative_sampling_prob_array,
)

In [21]:
print(f'{batch=}')
print(f'{neg_samples=}')

batch=array([[ 7757, 18989],
       [ 4856, 22846],
       [25345, 22497],
       [12509, 22849]])
neg_samples=array([[25727, 11273, 14447, 27423,  6615, 26594, 10131, 28022, 12265,
         1172, 18079, 16191, 19917, 25727, 27503],
       [15978,  9463, 24715, 16405, 11322,  4526, 21118,  8911, 15929,
         6323,  8430,  5031,  2977,  6495, 26703],
       [ 8191, 12537, 16776, 26722, 13346,  8788,  1207, 22849, 15466,
        15891,  6270, 15504,  6404, 16517, 18066],
       [26810, 19259,  2987,  5278, 22690,  3690, 25613,  1720, 25759,
        23129, 12924, 17290, 13633, 11257,  6509]])


Наконец, время реализовать модель. Обращаем ваше внимание, использование линейных слоев (`nn.Linear`) далеко не всегда оправданно!

Напомним, что в случае negative sampling решается задача максимизации следующего функционала:

$$
\mathcal{L} = \log \sigma({\mathbf{v}'_{w_O}}^\top \mathbf{v}_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma({-\mathbf{v}'_{w_i}}^\top \mathbf{v}_{w_I}) \right],
$$

где:
- $\mathbf{v}_{w_I}$ – вектор центрального слова $w_I$,
- $\mathbf{v}'_{w_O}$ – вектор слова из контекста $w_O$,
- $k$ – число negative samplesЮ,
- $P_n(w)$ – распределение negative samples, заданное выше,
- $\sigma$ – сигмоида.

In [35]:
from torch import Tensor

class SkipGramModelWithNegSampling(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        # Not sure about that
        self.center_embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)  # YOUR CODE HERE
        self.context_embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)  # YOUR CODE HERE

    def forward(self, center_words: Tensor, pos_context_words: Tensor, neg_context_words: Tensor):
        # YOUR CODE HERE
        # (batch_size, emb_dim)
        center_embs = self.center_embeddings(center_words)
        # (batch_size, emb_dim)
        pos_embs = self.context_embeddings(pos_context_words)
        # (batch_size, num_neg, emb_dim)
        neg_embs = self.context_embeddings(neg_context_words)

        # (batch_size, 1, 1)
        pos_scores = torch.bmm(center_embs.unsqueeze(1), pos_embs.unsqueeze(2))
        pos_scores = pos_scores.squeeze((1,2))

        # (batch_size, 1, num_neg)
        neg_scores = -torch.bmm(center_embs.unsqueeze(1), neg_embs.transpose(1,2))
        neg_scores = neg_scores.squeeze(1)

        return pos_scores, neg_scores

In [36]:
device = torch.device("cpu")

In [37]:
vocab_size = len(word_to_index)
embedding_dim = 32
num_negatives = 15

model = SkipGramModelWithNegSampling(vocab_size, embedding_dim).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.05)
lr_scheduler = ReduceLROnPlateau(optimizer, factor=0.5, patience=150)
criterion = nn.BCEWithLogitsLoss()

In [38]:
params_counter = 0
for weights in model.parameters():
    params_counter += weights.shape.numel()
assert params_counter == len(word_to_index) * embedding_dim * 2

In [None]:
def train_skipgram_with_neg_sampling(
    model,
    context_pairs,
    keep_prob_array,
    word_to_index,
    batch_size,
    num_negatives,
    negative_sampling_prob_array,
    steps,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    device=device,
):
    pos_labels = torch.ones(batch_size).to(device)
    neg_labels = torch.zeros(batch_size, num_negatives).to(device)
    loss_history = []
    for step in tqdma(range(steps)):
        batch, neg_samples = generate_batch_with_neg_samples(
            context_pairs,
            batch_size,
            keep_prob_array,
            word_to_index,
            num_negatives,
            negative_sampling_prob_array,
        )
        center_words = torch.tensor([pair[0] for pair in batch], dtype=torch.long).to(
            device
        )
        pos_context_words = torch.tensor(
            [pair[1] for pair in batch], dtype=torch.long
        ).to(device)
        neg_context_words = torch.tensor(neg_samples, dtype=torch.long).to(device)

        optimizer.zero_grad()
        pos_scores, neg_scores = model(
            center_words, pos_context_words, neg_context_words
        )

        loss_pos = criterion(pos_scores, pos_labels)
        loss_neg = criterion(neg_scores, neg_labels)

        loss = loss_pos + loss_neg
        loss.backward()
        optimizer.step()

        loss_history.append(loss.item())
        lr_scheduler.step(loss_history[-1])

        if step % 100 == 0:
            print(
                f"Step {step}, Loss: {np.mean(loss_history[-100:])}, learning rate: {lr_scheduler._last_lr}\n",
                # f"{pos_scores=}"
            )

        return loss_history

NameError: name 'loss_history' is not defined

In [None]:
steps = 2500
batch_size = 512
loss_hisrory = train_skipgram_with_neg_sampling(
    model,
    context_pairs,
    keep_prob_array,
    word_to_index,
    batch_size,
    num_negatives,
    negative_sampling_prob_array,
    steps,
)

  0%|          | 1/2500 [00:01<56:33,  1.36s/it]

Step 0, Loss: 1.5857232809066772, learning rate: [0.05]



  4%|▍         | 101/2500 [02:23<56:25,  1.41s/it] 

Step 100, Loss: 1.0355812364816666, learning rate: [0.05]



  8%|▊         | 201/2500 [04:45<54:05,  1.41s/it]  

Step 200, Loss: 0.41092193230986596, learning rate: [0.05]



 12%|█▏        | 301/2500 [07:07<54:57,  1.50s/it]

Step 300, Loss: 0.21098363049328328, learning rate: [0.05]



 16%|█▌        | 401/2500 [09:27<48:53,  1.40s/it]

Step 400, Loss: 0.13311262771487237, learning rate: [0.05]



 20%|██        | 501/2500 [11:43<45:01,  1.35s/it]

Step 500, Loss: 0.08764302264899015, learning rate: [0.05]



 24%|██▍       | 601/2500 [14:00<45:23,  1.43s/it]

Step 600, Loss: 0.06620079077780247, learning rate: [0.05]



 28%|██▊       | 701/2500 [16:17<40:21,  1.35s/it]

Step 700, Loss: 0.046445772387087345, learning rate: [0.05]



 32%|███▏      | 801/2500 [18:32<37:59,  1.34s/it]

Step 800, Loss: 0.03721361530944705, learning rate: [0.05]



 36%|███▌      | 901/2500 [20:48<36:04,  1.35s/it]

Step 900, Loss: 0.026280565350316466, learning rate: [0.05]



 40%|████      | 1001/2500 [23:06<38:05,  1.52s/it]

Step 1000, Loss: 0.020522646834142507, learning rate: [0.05]



 44%|████▍     | 1101/2500 [25:22<30:57,  1.33s/it]

Step 1100, Loss: 0.015316692893393338, learning rate: [0.05]



 48%|████▊     | 1201/2500 [27:38<28:49,  1.33s/it]

Step 1200, Loss: 0.013685970166698098, learning rate: [0.05]



 52%|█████▏    | 1301/2500 [29:54<26:39,  1.33s/it]

Step 1300, Loss: 0.011162807957734912, learning rate: [0.05]



 56%|█████▌    | 1401/2500 [32:14<25:24,  1.39s/it]

Step 1400, Loss: 0.007387274271459319, learning rate: [0.05]



 60%|██████    | 1501/2500 [34:31<22:21,  1.34s/it]

Step 1500, Loss: 0.008582251453190111, learning rate: [0.05]



 64%|██████▍   | 1601/2500 [36:48<20:15,  1.35s/it]

Step 1600, Loss: 0.004864924727589823, learning rate: [0.05]



 68%|██████▊   | 1701/2500 [39:05<18:50,  1.41s/it]

Step 1700, Loss: 0.005282905707717873, learning rate: [0.05]



 72%|███████▏  | 1801/2500 [41:24<16:37,  1.43s/it]

Step 1800, Loss: 0.0040356121875811364, learning rate: [0.05]



 76%|███████▌  | 1901/2500 [43:41<14:09,  1.42s/it]

Step 1900, Loss: 0.0034622349268465767, learning rate: [0.05]



 80%|████████  | 2001/2500 [45:58<11:27,  1.38s/it]

Step 2000, Loss: 0.003191858880163636, learning rate: [0.05]



 84%|████████▍ | 2101/2500 [48:12<08:55,  1.34s/it]

Step 2100, Loss: 0.0013325188515591435, learning rate: [0.025]



 88%|████████▊ | 2201/2500 [50:26<06:42,  1.35s/it]

Step 2200, Loss: 0.001706591301044682, learning rate: [0.025]



 92%|█████████▏| 2301/2500 [52:41<04:26,  1.34s/it]

Step 2300, Loss: 0.0019780720891139935, learning rate: [0.025]



 96%|█████████▌| 2401/2500 [54:56<02:11,  1.33s/it]

Step 2400, Loss: 0.0014341690034780185, learning rate: [0.025]



100%|██████████| 2500/2500 [57:14<00:00,  1.37s/it]


Наконец, используйте полученную матрицу весов в качестве матрицы в векторными представлениями слов. Рекомендуем использовать для сдачи матрицу, которая отвечала за слова из контекста (т.е. декодера).

In [45]:
_model_parameters = model.parameters()
embedding_matrix_center = next(
    _model_parameters
).detach()  # Assuming that first matrix was for central word
embedding_matrix_context = next(
    _model_parameters
).detach()  # Assuming that second matrix was for context word

In [46]:
print(f'{embedding_matrix_center[:10]=}')
print(f'{embedding_matrix_context[:10]=}')

embedding_matrix_center[:10]=tensor([[-3.1086e-01, -1.7479e+00, -4.2378e-01, -1.1796e+00,  3.5395e-01,
         -1.1635e+00, -6.5453e-01, -1.7161e+00,  2.5304e+00, -6.7543e-02,
         -9.6816e-02, -2.7887e+00,  9.2194e-03, -1.8614e+00,  3.6164e-01,
         -9.2888e-03,  1.2507e+00, -1.4830e+00,  1.2328e-01, -2.4812e+00,
          7.2921e-02,  2.3331e+00, -1.2697e+00,  2.7235e-01,  1.4324e+00,
          4.3181e-01, -3.0064e+00, -1.6165e+00,  9.2025e-01, -2.7863e-01,
          1.9407e+00,  5.0848e-01],
        [ 1.4224e+00,  3.3928e-01, -1.2822e+00, -1.5142e+00,  8.6194e-02,
         -1.2486e+00,  1.3501e-01,  8.9190e-01,  8.2051e-01, -4.9097e-01,
          1.3755e-01, -2.4599e-03, -1.3573e-01, -4.6175e-01,  3.3414e-01,
          5.7097e-01,  7.1084e-01,  3.1288e-02,  2.3180e-02,  8.2961e-02,
         -6.5555e-01,  1.7680e+00,  2.3823e-01, -1.8984e-01, -1.2187e-01,
          1.2758e-01, -7.7532e-01, -8.8085e-01,  7.5672e-01, -5.8812e-01,
          2.1456e+00, -2.0109e-01],
        [ 1

In [47]:
def get_word_vector(word, embedding_matrix, word_to_index=word_to_index):
    return embedding_matrix[word_to_index[word]]

Простые проверки:

In [48]:
similarity_1 = F.cosine_similarity(
    get_word_vector("iphone", embedding_matrix_context)[None, :],
    get_word_vector("apple", embedding_matrix_context)[None, :],
)
similarity_2 = F.cosine_similarity(
    get_word_vector("iphone", embedding_matrix_context)[None, :],
    get_word_vector("dell", embedding_matrix_context)[None, :],
)

print(f'{similarity_1=}, {similarity_2=}')
assert similarity_1 > similarity_2

similarity_1=tensor([0.6156]), similarity_2=tensor([0.5301])


In [50]:
similarity_1 = F.cosine_similarity(
    get_word_vector("windows", embedding_matrix_context)[None, :],
    get_word_vector("laptop", embedding_matrix_context)[None, :],
)
similarity_2 = F.cosine_similarity(
    get_word_vector("windows", embedding_matrix_context)[None, :],
    get_word_vector("macbook", embedding_matrix_context)[None, :],
)
print(f'{similarity_1=}, {similarity_2=}')
assert similarity_1 > similarity_2

similarity_1=tensor([0.6512]), similarity_2=tensor([0.5122])


Наконец, взглянем на ближайшие по косинусной мере слова. Функция реализована ниже.

In [51]:
def find_nearest(word, embedding_matrix, word_to_index=word_to_index, k=10):
    word_vector = get_word_vector(word, embedding_matrix)[None, :]
    dists = F.cosine_similarity(embedding_matrix, word_vector)
    index_sorted = torch.argsort(dists)
    top_k = index_sorted[-k:]
    return [(index_to_word[x], dists[x].item()) for x in top_k.numpy()]

In [52]:
find_nearest("python", embedding_matrix_context, k=10)

[('tails', 0.7906049489974976),
 ('protect', 0.790946364402771),
 ('dan', 0.7925320863723755),
 ('constructing', 0.7944867610931396),
 ('world', 0.7951883673667908),
 ('scans', 0.79781174659729),
 ('job', 0.8161370754241943),
 ('apps', 0.8236046433448792),
 ('ncis', 0.8312482833862305),
 ('python', 0.9999999403953552)]

Также вы можете визуально проверить, как представлены в латентном пространстве часто встречающиеся слова.

In [58]:
top_k = 5000
_top_words = sorted([x for x in word_count_dict.items()], key=lambda x: x[1])[
    -top_k - 100 : -100
]  # ignoring 100 most frequent words
top_words = [x[0] for x in _top_words]
del _top_words

print(f'{top_words[:10]=}')

top_words[:10]=['verified', '2gb', 'illinois', 'jazz', 'checking', 'telescope', 'seasons', 'astrologer', 'contribution', 'homemade']


In [59]:
word_embeddings = torch.cat(
    [embedding_matrix_context[word_to_index[x]][None, :] for x in top_words], dim=0
).numpy()

print(f'{word_embeddings=}')

word_embeddings=array([[ 1.4705968 , -1.1918211 ,  0.59138453, ..., -0.20900188,
         1.7666626 ,  0.32144344],
       [ 2.749911  , -2.481285  , -2.0097427 , ..., -0.08414183,
         0.27209884,  0.6698555 ],
       [ 1.5476143 , -0.10415564,  0.48510742, ..., -0.9274145 ,
        -0.12114553,  0.54225826],
       ...,
       [ 0.67550755, -0.13705188, -0.5143061 , ..., -0.72724795,
         0.8612631 ,  0.27102184],
       [ 1.5657055 , -1.1128488 , -1.1836333 , ..., -0.04446257,
         1.0684863 , -0.287945  ],
       [ 0.95971817, -0.29702878, -0.90587837, ..., -1.1176012 ,
         0.6878501 ,  1.0904022 ]], dtype=float32)


In [55]:
import bokeh.models as bm
import bokeh.plotting as pl
from bokeh.io import output_notebook

output_notebook()


def draw_vectors(
    x,
    y,
    radius=10,
    alpha=0.25,
    color="blue",
    width=600,
    height=400,
    show=True,
    **kwargs,
):
    """draws an interactive plot for data points with auxilirary info on hover"""
    if isinstance(color, str):
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({"x": x, "y": y, "color": color, **kwargs})

    fig = pl.figure(active_scroll="wheel_zoom", width=width, height=height)
    fig.scatter("x", "y", size=radius, color="color", alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show:
        pl.show(fig)
    return fig

In [56]:
embedding = umap.UMAP(n_neighbors=5).fit_transform(word_embeddings)

In [57]:
draw_vectors(embedding[:, 0], embedding[:, 1], token=top_words)

Для сдачи задания необходимо загрузить функции `subsample_frequent_words` и `get_negative_sampling_prob`, а также сгенерировать файл для посылки ниже и приложить в соответствующую задачу. Успехов!

In [60]:
# do not change the code in the block below
# __________start of block__________
import os
import json

assert os.path.exists(
    "words_subset.txt"
), "Please, download `words_subset.txt` and place it in the working directory"

with open("words_subset.txt") as iofile:
    selected_words = iofile.read().split("\n")


def get_matrix_for_selected_words(selected_words, embedding_matrix, word_to_index):
    word_vectors = []
    for word in selected_words:
        index = word_to_index.get(word, None)
        vector = [0.0] * embedding_matrix.shape[1]
        if index is not None:
            vector = embedding_matrix[index].numpy().tolist()
        word_vectors.append(vector)
    return word_vectors


word_vectors = get_matrix_for_selected_words(
    selected_words, embedding_matrix_context, word_to_index
)

with open("submission_dict.json", "w") as iofile:
    json.dump(word_vectors, iofile)
print("File saved to `submission_dict.json`")
# __________end of block__________

File saved to `submission_dict.json`
