# Embeddings

## Word2Vec

Vector models that we considered before (TF -idf, Bow) are conditionally called *counting *. They are based on the fact that one way or another they “consider” the words and their neighbors, and based on this they build a vector for words.

Another class of models, which is more commonly common to date, is called *predictive *(or *neural *) models. The idea of ​​these models is to use neural networks that "predict" (and not count) neighbors of words. One of the most famous such models is Word2VEC. The technology is based on a neural network that predicts the likelihood of finding a word in a given context. This tool was developed by a group of Google researchers in 2013, the project manager was Tomash Mikolov (now working on Facebook). Here are the two most important articles:

* [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
* [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)


The vectors obtained in this way are called*distributed representations of the words*, or ** embeddings **.


### How does it study?
We set a vector for each word using a $ W $ matrix and a context vector using a $ w $ matrix. In fact, Word2VEC is a generalizing name for two architectures of SKIP-Gram and Continous Bag-OF-Words (CBOW).

** CBOW ** predicts the current word based on the context surrounding it.

** Skip-Gram **, on the contrary, uses the current word to predict the words surrounding him.

### How does it work?
Word2VEC accepts a large text case as input data and compares each word of the vector, giving out the coordinates of words at the output. First, he creates a dictionary, “studying” on the input text data, and then calculates the vector representation of words. The vector representation is based on contextual proximity: the words found in the text next to the same words (and therefore, according to the distribution hypothesis that have a similar meaning), in the vector representation will have close coordinates of the vectors-words. To calculate the proximity of words, the cosine distance between their vectors is used.


With the help of distribution vector models, you can build semantic proportions (they are also analogy) and solve examples:

* * King: Man = Queen: Woman *
$\Rightarrow$
* * king - man + woman = queen *

![w2v](https://cdn-images-1.medium.com/max/2600/1*sXNXYfAqfLUeiDXPCo130w.png)

### Problems
It is impossible to establish the type of semantic relations between the words: synonyms, antonyms, etc. They will be equally close because they are usually used in similar contexts. Therefore, the words close in the vector space are called *semantic associates *. This means that they are semantically connected, but how exactly it is incomprehensible.


### in rusvectics


On the XX_MarkDown_Link_XX website, pretended to learn from various data for the Russian language are collected, and you can also look for the closest words to the given, calculate the semantic proximity of a few words and post examples using the “calculator of semantic proximity”.


For other languages, you can also find the learned models - for example, the XX_MarkDown_Link_XX and XX_MarkDown_Link_XX models (about them a little further).

### visualization
And [вот тут](https://projector.tensorflow.org/) has good visualization for English.

## Gensim

You can use the pre -study model of embedding or teach your own using the `Gensim` library. Here is [ее документация](https://radimrehurek.com/gensim/models/word2vec.html).

### How to use a finished model

Word2Vec models come in different formats:

* .Vec.gz - regular file
* .bin.gz - binary

They are loaded with the same class `keyedVectors', only the` binary` parameter is changing in the `Load_Word2VEC_FORMAT` function.

If the embeddings are trained ** not ** using Word2VEC, then to download the `Load` function. Those. To load the pre -learned Embeddings * Glove, FastText, BPE * and any others are needed.

Download with rusvectōrēs model for the Russian language, trained on the NKRA of the 2015 model.

In [1]:
!pip install pymorphy2
! git clone https://github.com/facebookresearch/fastText.git
! pip3 install fastText/.



fatal: destination path 'fastText' already exists and is not an empty directory.


Processing c:\ai\nlp\pc_4\fasttext
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml): started
  Building wheel for fasttext (pyproject.toml): finished with status 'error'
Failed to build fasttext


  error: subprocess-exited-with-error
  
  × Building wheel for fasttext (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [42 lines of output]
      !!
      
              ********************************************************************************
              Usage of dash-separated 'description-file' will not be supported in future
              versions. Please use the underscore name 'description_file' instead.
              (Affected: fasttext).
      
              By 2026-Mar-03, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
              ********************************************************************************
      
      !!
        opt = self._enforce_underscore(opt, section)
      !!
      
              ***************************************************************

In [2]:
! wget https://rusvectores.org/static/models/rusvectores4/unigrams/ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018.vec.gz
! wget https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/data/w2v/train/unlabeledTrainData.tsv
! wget https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/data/w2v/train/alice.txt
! wget https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/data/w2v/evaluation/ru_analogy_tagged.txt
! wget -O positive.csv https://www.dropbox.com/s/fnpq3z4bcnoktiv/positive.csv?dl=0
! wget -O negative.csv https://www.dropbox.com/s/r6u59ljhhjdg6j0/negative.csv?dl=0

--2025-10-03 15:19:46--  https://rusvectores.org/static/models/rusvectores4/unigrams/ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018.vec.gz
Распознаётся rusvectores.org (rusvectores.org)… 129.240.189.200
Подключение к rusvectores.org (rusvectores.org)|129.240.189.200|:443... соединение установлено.
HTTP-запрос отправлен. Ожидание ответа… 200 OK
Длина: 404014768 (385M) [application/x-gzip]
Сохранение в: «ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018.vec.gz.1»


2025-10-03 15:25:15 (1,18 MB/s) - «ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018.vec.gz.1» сохранён [404014768/404014768]

--2025-10-03 15:25:15--  https://raw.githubusercontent.com/ancatmara/data-science-nlp/master/data/w2v/train/unlabeledTrainData.tsv
Распознаётся raw.githubusercontent.com (raw.githubusercontent.com)… 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Подключение к raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... соединение установлено.
HTTP-запрос отправлен. Ожи

In [22]:
!pip install pymorphy3

from pymorphy3 import MorphAnalyzer



In [35]:
import re
import gensim
import logging
import nltk.data
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from gensim.models import word2vec
from nltk.tokenize import sent_tokenize, RegexpTokenizer
from pymorphy2 import MorphAnalyzer
from gensim.test.utils import datapath
nltk.download('punkt')

from nltk import FreqDist
from tqdm import tqdm_notebook as tqdm
from sklearn.manifold import TSNE

from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook

from sklearn.decomposition import TruncatedSVD
import fasttext

from functools import lru_cache
from multiprocessing import Pool
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Zhanibek\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
model_path = r'C:\AI\NLP\PC_4\ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018.vec'

model_ru = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=False)

Take a few words for example:

In [9]:
words = ['день_NOUN', 'ночь_NOUN', 'человек_NOUN', 'семантика_NOUN', 'биткоин_NOUN']

Frequency tags are needed, since it is the specifics of the downloaded model - it was trained in words announced by their parts of speech (and lemmetized). ** NB! ** The names of models on `rusvectores` indicate which tigset they use (Mystem, Upos, etc.)

Let us ask the model 10 closest neighbors for each word and the coefficient of cosine proximity to everyone:

In [10]:
for word in words:
    # есть ли слово в модели?
    if word in model_ru:
        print(word)
        # смотрим на вектор слова (его размерность 300, смотрим на первые 10 чисел)
        print(model_ru[word][:10])
        # выдаем 10 ближайших соседей слова:
        for word, sim in model_ru.most_similar(positive=[word], topn=10):
            # слово + коэффициент косинусной близости
            print(word, ': ', sim)
        print('\n')
    else:
        # Увы!
        print('Увы, слова "%s" нет в модели!' % word)

день_NOUN
[ 0.117177  0.008562 -0.054731  0.03821   0.006885  0.041716  0.063708
  0.070478  0.032087  0.050791]
неделя_NOUN :  0.7242119312286377
месяц_NOUN :  0.7178639769554138
утро_NOUN :  0.6738513708114624
вечер_NOUN :  0.6443345546722412
воскресенье_NOUN :  0.6362560391426086
час_NOUN :  0.632983386516571
накануне_ADV :  0.6304810047149658
днями_NOUN :  0.6276212930679321
днемя_NOUN :  0.621060848236084
ночь_NOUN :  0.6077756881713867


ночь_NOUN
[ 0.070529 -0.068594  0.029781  0.035559 -0.01488   0.072418 -0.01183
  0.051797 -0.024269  0.034406]
ночь_PROPN :  0.788449227809906
вечер_NOUN :  0.7778281569480896
утро_NOUN :  0.7638111710548401
полночь_NOUN :  0.7437741160392761
рассвет_NOUN :  0.6889956593513489
полдень_NOUN :  0.6811894178390503
утро_PROPN :  0.6788180470466614
сумерки_NOUN :  0.6461666226387024
напроать_NOUN :  0.6451945900917053
напролет_VERB :  0.6393611431121826


человек_NOUN
[ 0.022094 -0.077399  0.038363 -0.051602  0.000347  0.073115 -0.068763
 -0.037081 -

We find the cosine closeness of a pair of words:

In [11]:
print(model_ru.similarity('nvidia_PROPN', 'видеокарта_NOUN'))

0.78474295


What happens if you subtract Italy from pizza and add Siberia?

* Positive - vectors that we add
* Negative - vectors that we subtract

In [12]:
print(model_ru.most_similar(positive=['татарин_NOUN', 'казахстан_NOUN'], negative=['казах_NOUN'])[0][0])

татарстан_NOUN


In [13]:
model_ru.doesnt_match('бешбармак_NOUN плов_NOUN манты_NOUN'.split())

'манты_NOUN'

** Warm exercises **

Find an example of a multi-valued word for which in the top 10 (method `most_similar`) words similar to it include words related to different meanings:

In [14]:
model_ru.most_similar(positive=['ключ_NOUN'], topn=10)

[('ключ_ADJ', 0.6990529298782349),
 ('ключом_NOUN', 0.6460299491882324),
 ('ключ_PROPN', 0.6185873746871948),
 ('расшифровывание_NOUN', 0.5978683829307556),
 ('ключ_VERB', 0.5960569381713867),
 ('криптосистем_NOUN', 0.5936962962150574),
 ('отмычка_NOUN', 0.5862348675727844),
 ('отмычкий_NOUN', 0.577734649181366),
 ('зашифровывать_VERB', 0.5704914927482605),
 ('диффи-хеллман_PROPN', 0.5641186237335205)]

In [15]:
model_ru.most_similar(positive=['дверь_NOUN'], topn=10)

[('дверца_NOUN', 0.7919579744338989),
 ('дверка_NOUN', 0.783178448677063),
 ('отвориться_VERB', 0.7785481810569763),
 ('калитка_NOUN', 0.7784631252288818),
 ('дверной_ADJ', 0.7726610898971558),
 ('засов_NOUN', 0.756593644618988),
 ('настежь_ADV', 0.7547725439071655),
 ('приотвориться_VERB', 0.7538770437240601),
 ('прихожая_NOUN', 0.750787615776062),
 ('отворить_VERB', 0.7491068243980408)]

By analogy with Italy - pizza, Siberia - cracker, come up with a similar bunch of words for verification:

In [16]:
model_ru.most_similar(positive=['италия_NOUN'], negative=['пицца_NOUN'])

[('польша_NOUN', 0.3561699390411377),
 ('франция_NOUN', 0.3416089713573456),
 ('дплс_PROPN', 0.2917875051498413),
 ('албания_NOUN', 0.2896476089954376),
 ('норвегия_NOUN', 0.27540141344070435),
 ('португалия_NOUN', 0.2679791748523712),
 ('канада_NOUN', 0.26344889402389526),
 ('каталонии_NOUN', 0.2630119025707245),
 ('италию_NOUN', 0.2511085271835327),
 ('греция_NOUN', 0.24830304086208344)]

In [17]:
model_ru.most_similar(positive=['король_NOUN', 'женщина_NOUN'], negative=['мужчина_NOUN'])

[('королева_NOUN', 0.7274051308631897),
 ('королева_ADV', 0.6973327398300171),
 ('королева_ADJ', 0.6478139162063599),
 ('король_PROPN', 0.6415081024169922),
 ('короля_NOUN', 0.6195558905601501),
 ('королева-консорт_NOUN', 0.6013587117195129),
 ('людовик_PROPN', 0.5912467837333679),
 ('-консорт_NOUN', 0.5904234647750854),
 ('королевство_NOUN', 0.5836767554283142),
 ('монарх_NOUN', 0.5797033309936523)]

Give an example of three words W1, W2, W3, such that W1 and W2 are synonyms, W1 and W3 are antonyms, but at the same time, Similarity (W1, W2) <Similarity (W1, W3).

In [18]:
model_ru.most_similar(positive=['идти_VERB'], topn=10)

[('идти_NOUN', 0.7714467644691467),
 ('пойти_VERB', 0.7684581875801086),
 ('идти_NUM', 0.6786747574806213),
 ('ихать_VERB', 0.658195972442627),
 ('идти_ADJ', 0.6211367249488831),
 ('повехать_VERB', 0.6183276772499084),
 ('брести_VERB', 0.612551748752594),
 ('подвигаться_VERB', 0.6117923855781555),
 ('плетиваться_VERB', 0.602618932723999),
 ('ида_VERB', 0.6008363962173462)]

### Exercise

Write a function that accepts the proposal to the input, and replaces the random noun in it with Associat - the closest word to it from the Word2VEC model.

NB: For this you will need a morphological analyzer. I advise you to use Pymorphy (we briefly talked about it at the last seminar).

How to use Pymorphy:

In [23]:
analyser = MorphAnalyzer()

In [24]:
# разобрать слово (в данном случае возможно два разбора, поэтому получаем список из двух элементов)
result = analyser.parse('слово')
result

[Parse(word='слово', tag=OpencorporaTag('NOUN,inan,neut sing,nomn'), normal_form='слово', score=0.59813, methods_stack=((DictionaryAnalyzer(), 'слово', 54, 0),)),
 Parse(word='слово', tag=OpencorporaTag('NOUN,inan,neut sing,accs'), normal_form='слово', score=0.401869, methods_stack=((DictionaryAnalyzer(), 'слово', 54, 3),))]

In [25]:
# достать часть речи
result[0].tag.POS

'NOUN'

In [26]:
# поставить в дательный падеж
result[0].inflect(frozenset(['datv'])).word

'слову'

Your function (for simplicity, you can not try to put the word in the "necessary" form and limit yourself to a nominative case):

In [27]:
model_ru.most_similar(positive=['купить_VERB'], topn=1)

[('покупать_VERB', 0.8457080125808716)]

In [28]:
sent = 'маленький_ADJ человек_NOUN увидеть_VERB обезьяна_NOUN италия_NOUN'.split()
new_sent = ''
for word in sent:
    # word = word.lower()
    result = analyser.parse(word)[0].tag.POS
    new_word = f'{word}_{result}'
    top1 = model_ru.most_similar(positive=[word], topn=1)[0][0].split('_')[0]
    new_sent += top1 + ' '
new_sent

'крошечный человеколо видеть обезьяна франция '

In [29]:
def change_random_noun(sentence):
    # your code
  pass

## how to train your model

As training data, we take the marked and unreasonable reviews about films (the dataset is taken with Kaggle).

In [31]:
data = pd.read_csv(r"C:\AI\NLP\PC_4\unlabeledTrainData.tsv.txt", header=0, delimiter="\t", quoting=3)

len(data)

50000

In [32]:
data.head()

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


We remove the link from the data, the HTML damping and non-table symbols, and then bring everything to the lower register and tocenize. The output is an array of sentences, each of which is an array of words. A tokenizer from the `nltk` library is used here.

In [68]:
import nltk
nltk.data.path.append(r"C:\AI\NLP\PC_4")
nltk.download('punkt', download_dir=r"C:\AI\NLP\PC_4")

[nltk_data] Downloading package punkt to C:\AI\NLP\PC_4...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
tokenizer = nltk.data.load(r"C:\AI\NLP\PC_4\tokenizers\punkt\english.pickle")

URLError: <urlopen error unknown url type: c>

In [51]:
def review_to_wordlist(review, remove_stopwords=False ):
    # убираем ссылки
    review = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", " ", review)
    # достаем сам текст
    review_text = BeautifulSoup(review, "lxml").get_text()
    # оставляем только буквенные символы
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    # приводим к нижнему регистру и разбиваем на слова по символу пробела
    words = review_text.lower().split()
    if remove_stopwords: # убираем стоп-слова
        stops = stopwords.words("english")
        words = [w for w in words if not w in stops]
    return(words)

def review_to_sentences(review, tokenizer, remove_stopwords=False):
    # разбиваем обзор на предложения
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []
    # применяем предыдущую функцию к каждому предложению
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
    return sentences

In [52]:
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = []

print("Parsing sentences from training set...")
for review in data["review"]:
    sentences += review_to_sentences(review, tokenizer)

Parsing sentences from training set...


NameError: name 'tokenizer' is not defined

In [40]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [41]:
print(len(sentences))
print(sentences[0])

529416
['watching', 'time', 'chasers', 'it', 'obvious', 'that', 'it', 'was', 'made', 'by', 'a', 'bunch', 'of', 'friends']


In [43]:
# это понадобится нам позже

with open('clean_text.txt', 'w') as f:
    for s in sentences[:5000]:
        f.write(' '.join(s))
        f.write('\n')

We teach and save the model.


Main parameters:
* Data must be an iterized object
* Size - size vector,
* Window - observation window size,
* min_count - min. The frequency of the word in the case,
* SG-used learning algorithm (0-CBOW, 1-SKIP-GRAM),
* sample - threshold for download of high -frequency words,
* works - the number of flows,
* alpha — learning rate,
* Iter - the number of iterations,
* max_vocab_size - allows you to set the restriction from memory when creating a dictionary (i.e., if the restriction is exceeded, then low -frequency words will be released). For comparison: 10 million words = 1GB RAM.

** NB! ** Please note that the modeling of the model does not include preprocessing! This means that getting rid of punctuation, leading words to the lower register, lemmetizing them, and frequency tags will have to be affixed before modeling the model (unless, of course, this is necessary for your task). Those. In what form the words will be in the original text, in this they will be in the model.

In [44]:
print("Training model...")

%time
model_en = word2vec.Word2Vec(sentences, workers=4, vector_size=300, min_count=10, window=10, sample=1e-3)

Training model...
CPU times: user 5 μs, sys: 6 μs, total: 11 μs
Wall time: 27.9 μs


We look how many words are in the model.

In [45]:
print(len(model_en.wv.key_to_index))

28308


Let's try to evaluate the model manually by solving examples. A little given below, try to come up with your own.

In [49]:
print(model_en.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=3))
print(model_en.wv.most_similar(positive=["dogs", "man"], negative=["dog"], topn=1))

print(model_en.wv.most_similar("usa", topn=3))

print(model_en.wv.doesnt_match("comedy thriller western novel".split()))

[('princess', 0.5563912391662598), ('queen', 0.5294706225395203), ('jane', 0.49821794033050537)]
[('men', 0.6301687955856323)]
[('germany', 0.7327558994293213), ('north', 0.7213141322135925), ('canada', 0.7151143550872803)]
novel


### How to receive an existing model

When training, the “weight” models are initialized by accident, but it can be used to initialize weight vectors from a pre -scientific model, thus, as it were, who is accustomed to it.

First, let's see the proximity of some pair of words in the existing model, then to compare the result with the learned one.

In [55]:
model_en.wv.similarity('italy', 'north')

0.6706852

As additional data for learning, take the English text “Alice in the Zadaker”.

In [56]:
with open("alice.txt", 'r', encoding='utf-8') as f:
    text = f.read()

# убираем переносы строк, токенизируем текст
text = re.sub('\n', ' ', text)
sents = sent_tokenize(text)

# убираем всю пунктуацию и делим текст на слова по пробелу
punct = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~„“«»†*—/\-‘’'
clean_sents = []
for sent in sents:
    s = [w.lower().strip(punct) for w in sent.split()]
    clean_sents.append(s)

print(clean_sents[:2])

[['through', 'the', 'looking-glass', 'by', 'lewis', 'carroll', 'chapter', 'i', 'looking-glass', 'house', 'one', 'thing', 'was', 'certain', 'that', 'the', 'white', 'kitten', 'had', 'had', 'nothing', 'to', 'do', 'with', 'it', '', 'it', 'was', 'the', 'black', 'kitten’s', 'fault', 'entirely'], ['for', 'the', 'white', 'kitten', 'had', 'been', 'having', 'its', 'face', 'washed', 'by', 'the', 'old', 'cat', 'for', 'the', 'last', 'quarter', 'of', 'an', 'hour', 'and', 'bearing', 'it', 'pretty', 'well', 'considering', 'so', 'you', 'see', 'that', 'it', 'couldn’t', 'have', 'had', 'any', 'hand', 'in', 'the', 'mischief']]


  punct = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~„“«»†*—/\-‘’'


To receive the model, you must first save it, and then load it. All parameters of the training (the size of the vector, min. The frequency of the word, etc.) will be taken from the loaded model, i.e. You can’t again set them.

** NB! ** You can only receive a full model, and `keyedVectors` is impossible. Therefore, the model must be preserved in a corresponding format. Read more about the difference XX_MarkDown_Link_XX.

In [57]:
model_path = "movie_reviews.model"

print("Saving model...")
model_en.save(model_path)

Saving model...


In [58]:
model = word2vec.Word2Vec.load(model_path)

model.build_vocab(clean_sents, update=True)
model.train(clean_sents, total_examples=model.corpus_count, epochs=5)

(97172, 150225)

Leo and rabbit have become closer to each other!

In [60]:
model.wv.similarity('lion', 'cat')

0.36905712

You can normalize the vector, then the model will occupy less RAM. However, after that it cannot be finished. Here, L2-Normalization is used: the vectors are normalized so that if you fold the squares of all elements of the vector, in total it turns out 1.

In addition, we will preserve not complete vectors, but `KeyedVectors`.

In [61]:
# model.init_sims(replace=True)
model_path = "movies_alice.bin"

print("Saving model...")
model.wv.save_word2vec_format(model_path, binary=True)

Saving model...


## assessment

This, of course, is good, but how to understand which model is better? Or, for example, I made my model, but how to understand how good it is?

For this, there are special datasets to assess the quality of distribution models. The main two: one measures the accuracy of solving problems on analogy (about Russia and dumplings), and the second is used to assess the coefficient of semantic proximity.

### Word Similarity

This method is to evaluate how ideas about the semantic proximity of words in the model are related to the "ideas" of people.

| Word 1 | Word 2 | proximity |
|------------|------------|----------|
| cat | dog | 0.7 |
| Cup | mug | 0.9 |

For each pair of words from a predetermined dataset, we can calculate the cosine distance, and get a list of such meanings. At the same time, we already have a list of meanings made by people. We can compare these two lists and understand how similar they are (for example, considering the correlation). This measure of similarity should talk about how well the model simulates the distance to the word.

### analogies

Another popular task for an “internal” assessment is called the task of searching analogies. As we have already analyzed above, with the help of simple arithmetic operations, we can modify the meaning of the word. If you collect a set of words-modifiers in advance, as well as the words that we want to get into the results of the modification, then based on the calculation of the number of “hits” in the desired word, we can evaluate how well the model works well.

As a word-modifier, we can use semantic analogies. Say, if we have some attitude of the "country country", then to evaluate the model we can use pairs like "Russia-Moscow", "Norway-Ovos", etc. Dataset will look as follows:

| Word 1 | Word 2 | Attitude
|------------|------------|---------------|
| Russia | Moscow | Country Country |
| Norway | Oslo Side

Considering the random two pairs from this set, we want to, having a triplet (Russia, Moscow, Norway), we want to get the word "Oslo", i.e. Find the word that will be in the same respect with the word "Norway" as "Russia" is with Moscow.

Datasets for the Russian language can be downloaded on a page with models on rusvectores. We calculate the quality of our NKRA model on the Dataset about the analogy:

In [65]:
file_txt = open('questions-words.txt', 'r')
file_txt.readline()

FileNotFoundError: [Errno 2] No such file or directory: 'questions-words.txt'

In [62]:
res = model.wv.evaluate_word_analogies(datapath('questions-words.txt'))

In [63]:
res[0]

0.2351340455553316

## visualization

You can see the resulting model by visualizing it, for example, on a plane.
### T-Sne

** T-SNE ** (*t-distributed stochastic neighbor embedding*)-a technique of non-linear decrease in dimension and visualization of multidimensional variables. It is designed specifically for the high dimension data L. van der Maaten and D. Hinton, XX_MarkDown_Link_XX. T-SNE is an iterative algorithm based on the calculation of pair distances between all objects (including therefore, it is quite slow).


We portray on a plane of 1000 most frequency words from a collection of texts about cinema:

In [48]:
top_words = []


fd = FreqDist()
for s in tqdm(sentences):
    fd.update(s)

for w in fd.most_common(1000):
    top_words.append(w[0])

print(top_words[:50:])
top_words_vec = model.wv[top_words]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for s in tqdm(sentences):


  0%|          | 0/529416 [00:00<?, ?it/s]

['the', 'and', 'a', 'of', 'to', 'is', 'it', 'in', 'i', 'this', 'that', 's', 'was', 'as', 'with', 'for', 'movie', 'but', 'film', 'you', 't', 'on', 'not', 'he', 'are', 'his', 'have', 'be', 'one', 'all', 'they', 'at', 'by', 'who', 'an', 'from', 'so', 'like', 'there', 'or', 'her', 'just', 'about', 'out', 'has', 'if', 'what', 'some', 'good', 'can']


In [49]:
top_words_vec = model.wv[top_words]

In [53]:
!pip install --upgrade numpy scikit-learn threadpoolctl


Collecting numpy
  Downloading numpy-2.3.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m400.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp312-cp312-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting threadpoolctl
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp312-cp312-macosx_12_0_arm64.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
  Attempting uninstall: threadpoolctl
    Found existing installation: threadpoolctl 2.2.0
    Uninstalling threadpoolctl-2.2.0:
      Successfully uninstalled threadpoolctl-2.2.0
  Attempting uninstall: scikit-learn
    Found existin

In [58]:
!pip install MulticoreTSNE

Collecting MulticoreTSNE
  Downloading MulticoreTSNE-0.1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: MulticoreTSNE
  Building wheel for MulticoreTSNE (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[27 lines of output][0m
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31m   [0m running build_py
  [31m   [0m creating build
  [31m   [0m creating build/lib.macosx-11.1-arm64-cpython-312
  [31m   [0m creating build/lib.macosx-11.1-arm64-cpython-312/MulticoreTSNE
  [31m   [0m copying MulticoreTSNE/__init__.py -> build/lib.macosx-11.1-arm64-cpython-312/MulticoreTSNE
  [31m   [0m creating build/lib.macosx-11.1-arm64-cpython-312/MulticoreTSNE/tests
  [31m   [0m copying MulticoreTSNE/tests/__init__.py -> build/lib.macosx-11.1-arm64-cpy

In [59]:
%%time
tsne = TSNE(n_components=2, random_state=0)


CPU times: user 19 μs, sys: 179 μs, total: 198 μs
Wall time: 280 μs


In [61]:
!conda install -c conda-forge scikit-learn




  conda config --add channels defaults

For more information see https://docs.conda.io/projects/conda/en/stable/user-guide/configuration/use-condarc.html

  deprecated.topic(
Retrieving notices: done


  conda config --add channels defaults

For more information see https://docs.conda.io/projects/conda/en/stable/user-guide/configuration/use-condarc.html

  deprecated.topic(
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/user/anaconda3

  added / updated specs:
    - scikit-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2025.8.3           |     pyhd8ed1ab_0         155 KB  conda-forge
    conda-24.11.3              |  py312h81bd7bf_0         1.1 MB  conda-forge
    libcxx-21.1.2              |       hf598326_0         555 KB  conda-fo

In [60]:
top_words_tsne = tsne.fit_transform(top_words_vec)

AttributeError: 'NoneType' object has no attribute 'split'

In [51]:
output_notebook()

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE (eng model, top1000 words)")

source = ColumnDataSource(data=dict(x1=top_words_tsne[:,0],
                                    x2=top_words_tsne[:,1],
                                    names=top_words))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

NameError: name 'top_words_tsne' is not defined

To calculate the transformation of T-SNE faster (and sometimes more effective), you can first reduce the dimension of the source data using, for example, SVD, and then use T-SNE:

In [None]:
svd_50 = TruncatedSVD(n_components=50)
top_words_vec_50 = svd_50.fit_transform(top_words_vec)
top_words_tsne2 = TSNE(n_components=2, random_state=0).fit_transform(top_words_vec_50)

In [None]:
output_notebook()

p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="word2vec T-SNE (eng model, top1000 words, +SVD)")

source = ColumnDataSource(data=dict(x1=top_words_tsne2[:,0],
                                    x2=top_words_tsne2[:,1],
                                    names=top_words))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)

show(p)

## fastxt

FastText uses not only words embeldings, but also n-grams. In the case, each word is automatically represented in the form of a set of symbolic N-grams. Say, if we set n = 3, then the vector for the word "What" will be represented by the sum of the vectors of the following trigrams: "<when", "whe", "her", "er", "re>" (where "<" and ">" symbols indicating the beginning of the word). Thanks to this, we can also receive vectors for words that are absent in the dictionary, as well as work effectively with texts containing errors and typos.

* [Статья](https://aclweb.org/anthology/Q17-1010)
* [Сайт](https://fasttext.cc/)
* [Тьюториал](https://fasttext.cc/docs/en/support.html)
* [Вектора для 157 языков](https://fasttext.cc/docs/en/crawl-vectors.html)
* [Вектора, обученные на википедии](https://fasttext.cc/docs/en/pretrained-vectors.html) (separately for 294 different languages)
* [Репозиторий](https://github.com/facebookresearch/fasttext)

There is a library `FastText` for python (with ready -made models you can work through` Gensim`).

In [66]:
# так можно обучить свою модель
ft_model = fasttext.train_unsupervised('clean_text.txt', minn=3, maxn=4, dim=300)

Read 0M words
Number of words:  2436
Number of labels: 0
Progress: 100.0% words/sec/thread:  102369 lr:  0.000000 avg.loss:  2.701468 ETA:   0h 0m 0s


In [67]:
ft_model.get_word_vector("movie")[:20]

array([-0.06825212,  0.0815747 , -0.09535978, -0.13768125,  0.133812  ,
        0.06558251, -0.19313507,  0.09495368, -0.05298023,  0.07751375,
       -0.03180939, -0.02382968, -0.07812291,  0.10220659,  0.02760745,
        0.22049384, -0.18999249, -0.12947278,  0.05135653,  0.04323684],
      dtype=float32)

In [69]:
ft_model.get_nearest_neighbors('queen')

[(0.9998032450675964, 'green'),
 (0.9996938705444336, 'between'),
 (0.9996770024299622, 'teen'),
 (0.9995895028114319, 'halloween'),
 (0.9995279312133789, 'ten'),
 (0.9994232058525085, 'aren'),
 (0.999408483505249, 'golden'),
 (0.9993984699249268, 'alien'),
 (0.9993875622749329, 'given'),
 (0.9993857741355896, 'screen')]

In [70]:
ft_model.get_analogies("woman", "man", "actor")

[(0.9992685317993164, 'act'),
 (0.9990484714508057, 'directors'),
 (0.9989953637123108, 'director'),
 (0.9989573955535889, 'connection'),
 (0.9989447593688965, 'position'),
 (0.9989436864852905, 'direct'),
 (0.9989328980445862, 'actions'),
 (0.9989065527915955, 'solution'),
 (0.9989024996757507, 'actors'),
 (0.9988884329795837, 'addition')]

In [66]:
# проблема с опечатками решена

ft_model.get_nearest_neighbors('actr')

[(0.9995944499969482, 'act'),
 (0.9991607666015625, 'actions'),
 (0.9990671873092651, 'actors'),
 (0.9990625977516174, 'actress'),
 (0.9989418387413025, 'incredible'),
 (0.9989348649978638, 'superb'),
 (0.9989345073699951, 'actor'),
 (0.9988469481468201, 'direction'),
 (0.9988455772399902, 'attractive'),
 (0.9987985491752625, 'impression')]

In [67]:
# проблема с out of vocabulary словами - тоже

ft_model.get_nearest_neighbors('moviegeek')

[(0.9995163083076477, 'move'),
 (0.9994441866874695, 'just'),
 (0.9993472695350647, 'wait'),
 (0.99933922290802, 'reviews'),
 (0.9992921352386475, 'view'),
 (0.9992758631706238, 'wax'),
 (0.9991967678070068, 'watchable'),
 (0.9991733431816101, 'remake'),
 (0.999147891998291, 'review'),
 (0.9991344809532166, 'did')]

In [69]:
! wget -O positive.csv https://www.dropbox.com/s/fnpq3z4bcnoktiv/positive.csv?dl=0
! wget -O negative.csv https://www.dropbox.com/s/r6u59ljhhjdg6j0/negative.csv?dl=0

zsh:1: no matches found: https://www.dropbox.com/s/fnpq3z4bcnoktiv/positive.csv?dl=0
zsh:1: no matches found: https://www.dropbox.com/s/r6u59ljhhjdg6j0/negative.csv?dl=0


In [70]:
positive = pd.read_csv('positive.csv', sep=';', usecols=[3], names=['text'])
positive['label'] = ['positive'] * len(positive)
negative = pd.read_csv('negative.csv', sep=';', usecols=[3], names=['text'])
negative['label'] = ['negative'] * len(negative)
df = pd.concat((positive, negative))
df.head()

Unnamed: 0,text,label
0,"@first_timee хоть я и школота, но поверь, у на...",positive
1,"Да, все-таки он немного похож на него. Но мой ...",positive
2,RT @KatiaCheh: Ну ты идиотка) я испугалась за ...,positive
3,"RT @digger2912: ""Кто то в углу сидит и погибае...",positive
4,@irina_dyshkant Вот что значит страшилка :D\nН...,positive


In [71]:
len(df)

226834

We will conduct standard pre -processing:

In [82]:
m = MorphAnalyzer()

regex = re.compile("[А-Яа-я:=!\)\()A-z\_\%/|]+")

def words_only(text, regex=regex):
    try:
        return regex.findall(text)
    except:
        return []

  regex = re.compile("[А-Яа-я:=!\)\()A-z\_\%/|]+")


In [83]:
#@lru_cache(maxsize=128)
# если вы работаете не колабе, можно заменить pymorphy на mystem и раскомментирвать первую строку про lru_cache
def lemmatize(text, pymorphy=m):
    try:
        return " ".join([pymorphy.parse(w)[0].normal_form for w in text])
    except:
        return " "

In [84]:
def clean_text(text):
    return lemmatize(words_only(text))

In [86]:
5

5

In [None]:
with Pool(8) as p:
    lemmas = list(tqdm(p.imap(clean_text, df['text']), total=len(df)))

df['lemmas'] = lemmas
df.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  lemmas = list(tqdm(p.imap(clean_text, df['text']), total=len(df)))


  0%|          | 0/226834 [00:00<?, ?it/s]

Process SpawnPoolWorker-6:
Process SpawnPoolWorker-2:
Process SpawnPoolWorker-3:
Process SpawnPoolWorker-1:
Process SpawnPoolWorker-5:
Process SpawnPoolWorker-7:
Process SpawnPoolWorker-8:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Process SpawnPoolWorker-4:
Traceback (most recent call last):
  File "/Users/user/anaconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/user/anaconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/user/anaconda3/lib/python3.12/multiprocessing/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/Users/user/anaconda3/lib/python3.12/multiprocessing/queues.py", line 389, in get
    return _ForkingPickler.loads(res)
 

We write down the received data in format for teaching the classifier:

In [87]:
X = df.lemmas.tolist()
y = df.label.tolist()

X, y = np.array(X), np.array(y)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)
print ("total train examples %s" % len(y_train))
print ("total test examples %s" % len(y_test))

AttributeError: 'DataFrame' object has no attribute 'lemmas'

In [51]:
with open('data.train.txt', 'w+') as outfile:
    for i in range(len(X_train)):
        outfile.write('__label__' + y_train[i] + ' '+ X_train[i] + '\n')


with open('test.txt', 'w+') as outfile:
    for i in range(len(X_test)):
        outfile.write('__label__' + y_test[i] + ' ' + X_test[i] + '\n')

In [52]:
classifier = fasttext.train_supervised('data.train.txt')
result = classifier.test('test.txt')

print('P@1:', result[1])
print('R@1:', result[2])
print('Number of examples:', result[0])

P@1: 0.8978839371593459
R@1: 0.8978839371593459
Number of examples: 74856
