## Домашнее задание 1
## Harry Potter and the Action Prediction Challenge from Natural Language


В этом домашнем задании вы будете работать с корпусом Harry Potter and the Action Prediction Challenge. Корпус собран из фанфиков о Гарри Поттере и состоит из двух частей: 1) сырые тексты, 2) фрагменты текстов, описывающих ситуацию, в которой произнесено заклинание.

Корпус описан в статье: https://arxiv.org/pdf/1905.11037.pdf

David Vilares and Carlos Gómez-Rodríguez. Harry Potter and the Action Prediction Challenge from Natural Language. 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics. To appear.

Код для сбора корпуса находится в репозитории: https://github.com/aghie/hpac . Корпус можно скачать по инструкции из этого репозитория, но для экономии времени авторы задания уже скачали и подготовили данные к работе.

Ссылки на собранный корпус:
* Сырые тексты:  https://www.dropbox.com/s/12toaj67fjrguhd/hpac_raw.zip?dl=0
* Токенизированные тексты в нижнем регистре: https://www.dropbox.com/s/1ndto6dce5wg7j2/hpac_lower_tokenized.zip?dl=0
* train-test-dev: https://www.dropbox.com/s/ftinwwjfyydevth/hpac_splits.zip?dl=0

Части 1, 2 задания должны быть выполнены на полных текстах (сырых или предобработанных -- на ваше усмотрение), Часть 3 – на разбиение на тестовое, отладочное и обучающее множества. Тестовое множество должно быть использовано исключительно для тестирования моделей, обучающее и отладочное – для выбора модели и параметров.

В статье и репозитории вы найдете идеи, которые помогут вам выполнить домашнее задание. Их стоит воспринимать как руководство к действию, и не стоит их копировать и переиспользовать. Обученные модели использовать не нужно, код для их обучения можно использовать как подсказку.

## ПРАВИЛА
1. Домашнее задание выполняется индивидуально.
2. Домашнее задание сдается в системе Anytask, куда можно попасть через инвайт.
3. Домашнее задание оформляется в отчета в ipython-тетрадке.
4. Отчет должен содержать: нумерацию заданий и пунктов, которые вы выполнили, код решения, и понятное пошаговое описание того, что вы сделали. Отчет должен быть написан в академическом стиле, без излишнего использования сленга и с соблюдением норм русского языка.
5. Не стоит копировать фрагменты лекций, статей и Википедии в ваш отчет.
6. Плагиат и любое недобросоветсное цитирование приводит к обнуление оценки.



### Данные


train, test, dev файлы

In [None]:
import pandas as pd

In [None]:
df_train = pd.read_csv('data/hpac_splits/hpac_training_128.tsv', sep = '\t', header = None)
df_val = pd.read_csv('data/hpac_splits/hpac_dev_128.tsv', sep = '\t', header = None)

df_test = pd.read_csv('data/hpac_splits/hpac_test_128.tsv', sep = '\t', header = None)

### Как использовать WordNet из nltk?

In [None]:
# скачиваем WordNet
import nltk
nltk.download('wordnet')

In [None]:
# слово -> множество синсетов (синонимов разных смыслов исходного слова)
from nltk.corpus import wordnet as wn
wn.synsets('magic')

In [None]:
# посмотрим, что внутри одного синсета
wn.synsets('magic')[1].lemmas()[0]

In [None]:
# возьмем лемму одного из слов из синсета
wn.synsets('magic')[1].lemmas()[-1].name()


## Часть 1. [1 балл] Эксплоративный анализ


In [75]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
from nltk import FreqDist
from itertools import combinations

import os
from collections import Counter
import matplotlib.pyplot as plt
from tqdm import (
    tqdm,
    trange,
)
import pickle
import pandas as pd

import gc
gc.enable()

import requests
import contractions
from tqdm import tqdm
tqdm.pandas()

from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)
from nltk.tokenize import word_tokenize
import string
import re 

nltk.download('stopwords')
nltk.download('punkt')

stopwords_ = set(stopwords.words('english'))
noise = (
    '…', '—', '_', 
    'ⅰ', 'ⅱ', 'ⅲ', 'ⅳ', 'ⅴ', 'ⅵ', 'ⅶ', 'ⅷ', 'ⅸ', 'ⅹ', 
    '⁰', '¹', '²', '³', '⁴', '⁵', '⁶', '⁷', '⁸', '⁹', '',
    '⅞', '¾',  '⅙', '⅔', '½', '¼', '⅛', '⅖',
    '　', '', '๑', 
    '', '', '', 
    )


INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

https://nalepae.github.io/pandarallel/troubleshooting/


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mikha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mikha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
def remove_URL(text):
    """
        Remove URLs from a sample string
    """
    return re.sub(r"https?://\S+|www\.\S+", "", text)

def remove_html(text):
    """
        Remove the html in sample text
    """
    html = re.compile(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    return re.sub(html, "", text)

def remove_non_ascii(text):
    """
        Remove non-ASCII characters 
    """
    return re.sub(r'[^\x00-\x7f]',r'', text)

def remove_special_characters(text):
    """
        Remove special special characters, including symbols, emojis, and other graphic characters
    """
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_punct(text):
    """
        Remove the punctuation
    """
#     return re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', "", text)
    return text.translate(str.maketrans('', '', string.punctuation))

In [3]:
folder_path = "data/fanfiction_texts/"

In [4]:
dataframe = pd.DataFrame(columns=['base_text'])

for file_name in tqdm(os.listdir(folder_path)):
    file_path = os.path.join(folder_path, file_name)
    
    with open(file_path, "r", encoding="utf-8") as file:
        text = file.read()

    dataframe.loc[len(dataframe), :] = text

  0%|          | 0/36225 [00:00<?, ?it/s]

100%|██████████| 36225/36225 [00:52<00:00, 685.56it/s] 


In [5]:
dataframe.drop_duplicates(inplace=True)
dataframe.reset_index(drop=True, inplace=True)

In [6]:
dataframe["text"] = dataframe['base_text']

In [7]:
print(len(dataframe))
dataframe.head()

36222


Unnamed: 0,base_text,text
0,"First, Harry heard a faint cursing, but this w...","First, Harry heard a faint cursing, but this w..."
1,Title: Questions Author: persephoneapple Pai...,Title: Questions Author: persephoneapple Pai...
2,"Harry Potter and all characters, etc. belong t...","Harry Potter and all characters, etc. belong t..."
3,Disclaimer: Harry Potter belongs to Rowling. ...,Disclaimer: Harry Potter belongs to Rowling. ...
4,Authors note: I hope you all enjoy this story....,Authors note: I hope you all enjoy this story....


### preprocessing

In [8]:
dataframe["text"] = dataframe.text.parallel_apply(contractions.fix)
dataframe["text"] = dataframe.text.apply(remove_URL)
dataframe["text"] = dataframe.text.apply(remove_html)
dataframe["text"] = dataframe.text.apply(remove_non_ascii)
dataframe["text"] = dataframe.text.apply(remove_special_characters)
gc.collect()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4528), Label(value='0 / 4528'))), …

0

In [9]:
dataframe["tokenize_text"] = dataframe["text"].parallel_apply(word_tokenize)
dataframe["clean_text"] = dataframe["tokenize_text"].apply(lambda x: ' '.join([word for word in x if word.lower() not in stopwords_]))
gc.collect()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4528), Label(value='0 / 4528'))), …

0

In [12]:
dataframe[['clean_text']].to_csv('dataframe.csv')

In [14]:
pd.read_parquet('dataframe.parquet')

Unnamed: 0,clean_text
0,"First , Harry heard faint cursing , nothing un..."
1,Title : Questions Author : persephoneapple Pai...
2,"Harry Potter characters , etc . belong J.K. Ro..."
3,Disclaimer : Harry Potter belongs Rowling . pl...
4,Authors note : hope enjoy story . looked anyon...
...,...
36217,Disclaimer : Harry Potter series . belong J.K....
36218,A/N : little thing wrote amusement . Probably ...
36219,always thought amusing Killing curse green gre...
36220,Warning : Rated Mature upsetting nonconsensual...


## FREQ TASK

### 1. Найдите топ-1000 слов по частоте без учета стоп-слов.

In [12]:
df = pd.read_parquet('dataframe.parquet')
df.head()

Unnamed: 0,clean_text
0,"First , Harry heard faint cursing , nothing un..."
1,Title : Questions Author : persephoneapple Pai...
2,"Harry Potter characters , etc . belong J.K. Ro..."
3,Disclaimer : Harry Potter belongs Rowling . pl...
4,Authors note : hope enjoy story . looked anyon...


In [13]:
df['lower_text_no_punct'] = df.clean_text.str.lower().apply(remove_punct)
df['lwower_tokenize'] = df['lower_text_no_punct'].parallel_apply(word_tokenize)
df['word_counter'] = df['lwower_tokenize'].apply(Counter)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4528), Label(value='0 / 4528'))), …

In [14]:
all_word_counter = df['word_counter'].sum()

with open('all_word_counter.pkl', 'wb') as f:
    pickle.dump(all_word_counter, f)

In [66]:
top_1000 = all_word_counter.most_common(1000)
pd.DataFrame(top_1000, columns=['word', 'freq'])

Unnamed: 0,word,freq
0,harry,3975920
1,s,2769627
2,would,2754058
3,said,2261000
4,hermione,1820846
...,...,...
995,huge,41231
996,joined,41216
997,crossed,41174
998,comes,41095


### 2. Найдите топ-10 по частоте: имен, пар имя + фамилия, пар вида ''профессор'' + имя / фамилия.

In [16]:
df = pd.read_parquet('dataframe.parquet')
df.head()

Unnamed: 0,clean_text
0,"First , Harry heard faint cursing , nothing un..."
1,Title : Questions Author : persephoneapple Pai...
2,"Harry Potter characters , etc . belong J.K. Ro..."
3,Disclaimer : Harry Potter belongs Rowling . pl...
4,Authors note : hope enjoy story . looked anyon...


In [17]:
import re

global extract_names
def extract_names(text, pattern):
    return re.findall(pattern, text)

re_patterns = {
    'names': r"\b([A-Z][a-zA-Z]*)\b",
    'full_names': r"\b([A-Z][a-zA-Z]*)\s([A-Z][a-zA-Z]*)\b",#r"\b(?![Pp]rofessor)\b([A-Z][a-zA-Z]*)\s([A-Z][a-zA-Z]*)\b",
    'professors': r"\b(?:[Pp]rofessor) [A-Z][a-zA-Z]+\b", #r"\b(?:[Pp]rofessor)\s([A-Z][a-zA-Z]*)\b",
}

In [18]:
pattern_freq_dict = dict()
for pattern in tqdm(re_patterns.keys()):
    df_pattern = df[["clean_text"]].copy()
    df_pattern[f"{pattern}"] = df_pattern["clean_text"].apply(lambda x: extract_names(x, re_patterns[pattern]))
    df_pattern[f"counter_{pattern}"] = df_pattern[f"{pattern}"].apply(Counter)
    
    df_pattern.to_csv(f"{pattern}_pattern.csv")
    df_pattern = df_pattern[[f"counter_{pattern}"]]
    
    pattern_freq_dict[pattern] = df_pattern[f"counter_{pattern}"].sum()

    del df_pattern
    gc.collect()

100%|██████████| 3/3 [2:39:56<00:00, 3198.91s/it]  


In [19]:
with open('pattern_freq_dict.pkl', 'wb') as f:
    pickle.dump(pattern_freq_dict, f)

### names

In [20]:
pd.DataFrame(pattern_freq_dict['names'].most_common(10), columns=['name', 'freq'])

Unnamed: 0,name,freq
0,Harry,3978917
1,Hermione,1824702
2,Draco,1384301
3,Ron,901258
4,Severus,653342
5,Ginny,636865
6,Sirius,627168
7,Potter,599628
8,Snape,597333
9,Malfoy,482966


### full names

In [22]:
pattern_freq_dict["full_names"].most_common(10)

[(('Dark', 'Lord'), 125836),
 (('Harry', 'Potter'), 122907),
 (('Death', 'Eaters'), 103013),
 (('Death', 'Eater'), 65974),
 (('Great', 'Hall'), 57613),
 (('Draco', 'Malfoy'), 47011),
 (('Professor', 'Snape'), 42019),
 (('Professor', 'McGonagall'), 40079),
 (('Ron', 'Hermione'), 39006),
 (('Miss', 'Granger'), 35868)]

In [29]:
top_full_names = {}
for pair, count in pattern_freq_dict["full_names"].most_common():
    if (pair[0] not in ('Mrs', 'Mr', 'Miss', 'Professor', 'Madam')) and pair not in (('Great', 'Hall'), ('Death', 'Eaters'), ('Dark', 'Lord'), ('Death', 'Eater'), ('Harry', 'Ron'), ('Lord', 'Voldemort'), ('Harry', 'Hermione'), ('Fred', 'George')):
        top_full_names[pair] = count
        if len(top_full_names) == 10:
            break
pd.DataFrame(Counter(top_full_names).most_common(10), columns=['name', 'freq'])

Unnamed: 0,name,freq
0,"(Harry, Potter)",122907
1,"(Draco, Malfoy)",47011
2,"(Ron, Hermione)",39006
3,"(Severus, Snape)",29636
4,"(Hermione, Granger)",29635
5,"(Sirius, Black)",26018
6,"(Diagon, Alley)",24209
7,"(Lucius, Malfoy)",22989
8,"(Albus, Dumbledore)",20756
9,"(James, Potter)",20309


NICE !


### professors

In [None]:
with open('pattern_freq_dict.pkl', 'rb') as f:
    pattern_freq_dict = pickle.load(f)

In [23]:
def change_first_letter_to_lower(s):
    if len(s) > 0:
        return s[0].lower() + s[1:]
    return s

In [24]:
professors_counter = Counter()
for key, value in pattern_freq_dict['professors'].items():
    new_key = change_first_letter_to_lower(key)

    professors_counter += Counter({new_key: value})
    
pd.DataFrame(professors_counter.most_common(10), columns=['professor', 'freq'])

Unnamed: 0,professor,freq
0,professor Snape,45093
1,professor McGonagall,42378
2,professor Dumbledore,23951
3,professor Lupin,9988
4,professor Flitwick,9489
5,professor Slughorn,5843
6,professor Sprout,5747
7,professor Trelawney,3158
8,professor Umbridge,2862
9,professor Longbottom,2367


#


[бонус] Постройте тематическую модель по корпусу HPAC.

[бонус] Найдите еще что-то интересное в корпусе (что-то специфичное для фанфиков или фентези-тематики)



## Часть 2. [2 балла] Модели представления слов
Обучите модель представления слов (word2vec, GloVe, fastText или любую другую) на материале корпуса HPAC.


In [13]:
import pandas as pd
import multiprocessing

from gensim.models import Word2Vec
from sklearn.manifold import TSNE

cores = multiprocessing.cpu_count()

In [14]:
df = pd.read_csv('dataframe.csv')
df['lower_text_no_punct'] = df.clean_text.str.lower().apply(remove_punct)
df['lwower_tokenize'] = df['lower_text_no_punct'].parallel_apply(word_tokenize)

corpus = df['lwower_tokenize'].to_list()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4528), Label(value='0 / 4528'))), …

In [15]:
# Обучение модели word2vec
model = Word2Vec(corpus, vector_size=100, window=5, min_count=5, workers=cores-1)

In [16]:
# Получение векторных представлений слов
word_embeddings = model.wv

In [17]:
word_embeddings.save_word2vec_format('word_embeddings.bin', binary=True)

In [18]:
# Получение списка всех слов
words = list(word_embeddings.key_to_index.keys())

# Получение векторов слов
word_vectors = word_embeddings[words]

In [19]:
# Снижение размерности векторов с помощью t-SNE
tsne = TSNE(n_components=2, random_state=42)
word_vectors_tsne = tsne.fit_transform(word_vectors)

In [20]:
# Создание DataFrame с информацией о словах и их векторах после t-SNE
word_df = pd.DataFrame(word_vectors_tsne, index=words, columns=['x', 'y'])

In [21]:
# Визуализация топ-1000 слов
top_words = word_df.sort_values(by='x').iloc[:1000]

### 1. Продемонстрируйте, как работает поиск синонимов, ассоциаций, лишних слов в обученной модели.

#### синонимы

In [22]:
def show_similar(w, word_embeddings):
    print('Слова, близкие к слову {}:'.format(w))
    similar_words = word_embeddings.most_similar(positive=[w])
    for w in similar_words:
        print(w)
    print('\n')

In [23]:
for w in ['spell', 'dating', 'gay']:
    show_similar(w, word_embeddings)

Слова, близкие к слову spell:
('counterspell', 0.8453128933906555)
('countercharm', 0.830565333366394)
('charm', 0.8287970423698425)
('spells', 0.8174357414245605)
('countercurse', 0.7937067747116089)
('curse', 0.7585912346839905)
('counterjinx', 0.7533854842185974)
('enchantment', 0.732124388217926)
('nonverbally', 0.7292430400848389)
('silencer', 0.7125468850135803)


Слова, близкие к слову dating:
('dated', 0.8043968677520752)
('fancying', 0.7645979523658752)
('fancied', 0.707612931728363)
('gay', 0.6927344799041748)
('marrying', 0.6773286461830139)
('shagging', 0.6731544733047485)
('talking', 0.658692479133606)
('crush', 0.6546252369880676)
('snogging', 0.651677131652832)
('bisexual', 0.6463444232940674)


Слова, близкие к слову gay:
('bisexual', 0.862683117389679)
('bi', 0.8157513737678528)
('lesbian', 0.7858288884162903)
('dating', 0.6927344799041748)
('fancied', 0.6824777126312256)
('sex', 0.6702308654785156)
('homosexual', 0.6663782000541687)
('heterosexual', 0.6616793870925903

Как видим "dating" = "gay", модель работает хорошо!

#### ассоциации

In [34]:
word_embeddings.most_similar(positive=["slytherin", "harry"], negative=["griffindor"], topn=3)

[('draco', 0.7733852863311768),
 ('malfoy', 0.6323437094688416),
 ('blaise', 0.610161542892456)]

В целом довольно логично

#### лишние слова

In [26]:
word_embeddings.doesnt_match(['batman', 'gosling', 'me', 'rich', 'gym', 'datascience'])

'me'

really sad moment

In [27]:
word_embeddings.doesnt_match(['money', 'girl', 'study', 'work', 'gym'])

'girl'

Как видим модель помогла расставить нам приоритеты и найти от чего нужно сейчас отказаться

### 2. Визуализируйте топ-1000 слов по частоте без учета стоп-слов (п. 1.1) с помощью TSNE или UMAP (https://umap-learn.readthedocs.io).

In [68]:
from sklearn.manifold import TSNE
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure
from bokeh.io import show, output_notebook

In [77]:
output_notebook()
words_top_vec = word_embeddings[[i[0] for i in top_1000]]
tsne = TSNE(n_components=2, random_state=0)
words_top_tsne = tsne.fit_transform(words_top_vec)
p = figure(tools="pan,wheel_zoom,reset,save",
           toolbar_location="above",
           title="Word2Vec t-SNE for most common words")

source = ColumnDataSource(data=dict(x1=words_top_tsne[:,0],
                                    x2=words_top_tsne[:,1],
                                    names=list([i[0] for i in top_1000])))

p.scatter(x="x1", y="x2", size=8, source=source)

labels = LabelSet(x="x1", y="x2", text="names", y_offset=6,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
p.add_layout(labels)
show(p)

## Часть 3. [6.5 баллов] Классификация текстов
Задача классификации формулируется так: данный фрагмент фанфика описывают какую-то ситуацию, которая предшествует произнесению заклинания. Требуется по тексту предсказать, какое именно заклинание будет произнесено. Таким образом, заклинание - это фактически метка класса. Основная мера качества – macro $F_1$.
Обучите несколько классификаторов и сравните их между собой. Оцените качество классификаторов на частых и редких классах. Какие классы чаще всего оказываются перепутаны? Связаны ли ошибки со смыслом заклинаний?

Используйте фрагменты из множества train для обучения, из множества dev для отладки, из множества test – для тестирования и получения итоговых результатов.



In [1]:
import pandas as pd
from sklearn.metrics import f1_score

In [None]:
from google.colab import drive
drive.mount('/content/drive')
drive_path = '/content/drive/MyDrive/nlp/hw1/'

Mounted at /content/drive


In [None]:
drive_path = '/content/drive/MyDrive/nlp/hw1/'

In [None]:
df_train = pd.read_csv(f'{drive_path}hpac_splits/hpac_training_128.tsv', sep = '\t', header = None)
df_dev = pd.read_csv(f'{drive_path}hpac_splits/hpac_dev_128.tsv', sep = '\t', header = None)
df_test = pd.read_csv(f'{drive_path}hpac_splits/hpac_test_128.tsv', sep = '\t', header = None)

In [None]:
df_train.head(3)

Unnamed: 0,0,1,2
0,7642954.0.676,RIDDIKULUS,were staring at her . she was up next to face ...
1,10443333.0.5753,RIDDIKULUS,"that whole time . her first reaction , for whi..."
2,4703706.0.8690,STUPEFY,we watched his inglorious withdrawal together ...


In [None]:
col_names = ['id', 'spell', 'text']
df_train.columns = col_names
df_dev.columns = col_names
df_test.columns = col_names

In [None]:
df_train.head(3)

Unnamed: 0,id,spell,text
0,7642954.0.676,RIDDIKULUS,were staring at her . she was up next to face ...
1,10443333.0.5753,RIDDIKULUS,"that whole time . her first reaction , for whi..."
2,4703706.0.8690,STUPEFY,we watched his inglorious withdrawal together ...


In [None]:
df_train.drop(columns=['id'], inplace=True)
df_dev.drop(columns=['id'], inplace=True)
df_test.drop(columns=['id'], inplace=True)

In [None]:
df_train.head(3)

Unnamed: 0,spell,text
0,RIDDIKULUS,were staring at her . she was up next to face ...
1,RIDDIKULUS,"that whole time . her first reaction , for whi..."
2,STUPEFY,we watched his inglorious withdrawal together ...


In [None]:
df_train = df_train[['text', 'spell']]
df_dev = df_dev[['text', 'spell']]
df_test = df_test[['text', 'spell']]

In [None]:
df_train.head(3)

Unnamed: 0,text,spell
0,were staring at her . she was up next to face ...,RIDDIKULUS
1,"that whole time . her first reaction , for whi...",RIDDIKULUS
2,we watched his inglorious withdrawal together ...,STUPEFY


In [None]:
df_train['spell'] = df_train['spell'].astype("category")
df_dev['spell'] = df_dev['spell'].astype("category")
df_dev['spell'] = df_dev['spell'].cat.set_categories(df_train['spell'].cat.categories)
df_test['spell'] = df_test['spell'].astype("category")
df_test['spell'] = df_test['spell'].cat.set_categories(df_train['spell'].cat.categories)

df_train['true'] = df_train['spell'].cat.codes.values.astype(str)
df_dev['true'] = df_dev['spell'].cat.codes.values.astype(str)
df_test['true'] = df_test['spell'].cat.codes.values.astype(str)

In [None]:
df_train.head(3)

Unnamed: 0,text,spell,true
0,were staring at her . she was up next to face ...,RIDDIKULUS,72
1,"that whole time . her first reaction , for whi...",RIDDIKULUS,72
2,we watched his inglorious withdrawal together ...,STUPEFY,80


In [None]:
with open(f'{drive_path}train.txt', 'w+') as outfile:
    for i in range(len(df_train)):
        outfile.write('__label__' + df_train.loc[i, 'true'] + ' ' + df_train.loc[i, 'text'] + '\n')

with open(f'{drive_path}dev.txt', 'w+') as outfile:
    for i in range(len(df_test)):
        outfile.write('__label__' + df_dev.loc[i, 'true'] + ' ' + df_dev.loc[i, 'text'] + '\n')

with open(f'{drive_path}test.txt', 'w+') as outfile:
    for i in range(len(df_test)):
        outfile.write('__label__' + df_test.loc[i, 'true'] + ' ' + df_test.loc[i, 'text'] + '\n')

### 1. [1 балл] Используйте fastText в качестве baseline-классификатора.


In [None]:
import fasttext

from sklearn.metrics import (
    precision_score, recall_score, f1_score, accuracy_score, classification_report, confusion_matrix
)

In [None]:
# Обучение модели
model = fasttext.train_supervised(input=f'{drive_path}train.txt', lr=0.1, epoch=100, wordNgrams=2)

# Сохранение модели
model.save_model(f'{drive_path}fasttext_model.bin')

# Загрузка модели
model = fasttext.load_model(f'{drive_path}fasttext_model.bin')



In [None]:
def predict_spell(text):
    predictions = model.predict(text, k=1)
    return predictions[0][0].replace('__label__', '')

In [None]:
df_test['pred'] = df_test.text.apply(predict_spell)

In [None]:
print(f"Precision: {precision_score(df_test.true, df_test.pred, average='macro'):.04f}")
print(f"Recall: {recall_score(df_test.true, df_test.pred, average='macro'):.04f}")
print(f"F1-measure: {f1_score(df_test.true, df_test.pred, average='macro'):.04f}")
print(f"Accuracy: {accuracy_score(df_test.true, df_test.pred):.04f}")
print(classification_report(df_test.true, df_test.pred))

  _warn_prf(average, modifier, msg_start, len(result))


Precision: 0.1560
Recall: 0.1206
F1-measure: 0.1260
Accuracy: 0.3115


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.30      0.33      0.31       516
           1       0.20      0.24      0.22        79
          10       0.00      0.00      0.00        17
          11       0.41      0.48      0.45       909
          12       0.00      0.00      0.00         6
          13       1.00      0.17      0.29         6
          14       0.00      0.00      0.00         8
          15       0.00      0.00      0.00         2
          16       0.00      0.00      0.00         1
          17       0.19      0.14      0.16        81
          18       0.00      0.00      0.00         4
          19       0.06      0.11      0.07        37
           2       0.43      0.40      0.41       164
          20       0.19      0.17      0.18        53
          21       0.00      0.00      0.00         5
          22       0.12      0.13      0.13        53
          23       0.53      0.50      0.52       221
          24       0.21    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


как видим на частых классах качество хорошое => будет полезно расширить выборку для редких классов

### 2. [2 балла] Используйте сверточные сети в качестве более продвинутого классификатора. Поэкспериментируйте с количеством и размерностью фильтров, используйте разные размеры окон, попробуйте использовать $k$-max pooling.


In [None]:
from tokenizers import Tokenizer, models, trainers
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, TensorDataset
from torcheval.metrics.functional import multiclass_f1_score

from tqdm import tqdm

if torch.cuda.is_available():
  device = torch.device("cuda")
  print("cuda")
else:
  device = torch.device("cpu")
  print("cpu")
batch_size = 64

cuda


In [None]:
class CNN(nn.Module):
    def __init__(self, vocab_size, output_dim, embedding_dim=300, n_filters=100, filter_sizes=[2, 4, 6], dropout=0.6):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv_0 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[0], embedding_dim))
        self.conv_1 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[1], embedding_dim))
        self.conv_2 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[2], embedding_dim))
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        #x = [batch size, sent len]
        embedded = self.embedding(x)

        #embedded = [batch size, sent len, emb dim]
        embedded = embedded.unsqueeze(1)

        #embedded = [batch size, 1, sent len, emb dim]
        conved_0 = F.relu(self.conv_0(embedded).squeeze(3))
        conved_1 = F.relu(self.conv_1(embedded).squeeze(3))
        conved_2 = F.relu(self.conv_2(embedded).squeeze(3))

        #conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
        pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)

        #pooled_n = [batch size, n_filters]
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))

        #cat = [batch size, n_filters * len(filter_sizes)]
        return self.fc(cat)

In [None]:
X_train, y_train = df_train['text'], df_train['true'].astype(int)
X_test, y_test = df_test['text'], df_test['true'].astype(int)

In [None]:
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
tokenizer.train_from_iterator(X_train, trainer=trainer)

In [None]:
X_train = tokenizer.encode_batch(X_train)
X_test = tokenizer.encode_batch(X_test)

X_train = list(map(lambda x: x.ids, X_train))
X_test = list(map(lambda x: x.ids, X_test))

max_len = max(max(len(seq) for seq in X_train), max(len(seq) for seq in X_test))
X_train = [text + [tokenizer.token_to_id("[PAD]")] * (max_len - len(text)) for text in X_train]
X_test = [text + [tokenizer.token_to_id("[PAD]")] * (max_len - len(text)) for text in X_test]

X_train = torch.tensor(X_train).to(device)
X_test = torch.tensor(X_test).to(device)

y_train = torch.tensor(y_train, dtype=torch.long).to(device)
y_test = torch.tensor(y_test, dtype=torch.long).to(device)

In [None]:
len(X_train), len(y_train)

(60980, 60980)

In [None]:
train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [None]:
vocab_size = len(tokenizer.get_vocab())
output_dim = df_train.true.nunique()
cnn_model = CNN(vocab_size, output_dim)

optimizer = torch.optim.Adam(cnn_model.parameters(), lr=1e-3)
loss_ = torch.nn.CrossEntropyLoss().to(device)

In [None]:
def log_to_file(text, file_name='log.txt'):
    with open(file_name, 'a') as file:
      print(text, file=file)

In [None]:
def train_(model, dataloader, optimizer, loss_func):
    epoch_loss = 0
    epoch_acc = 0
    epoch_f1 = 0

    model.train()

    for batch in tqdm(dataloader):
        inputs, labels = batch
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = loss_func(outputs, labels)
        preds = torch.argmax(outputs, dim=1)

        acc = torch.sum(preds == labels) / len(labels)
        f1_macro = multiclass_f1_score(preds, labels, num_classes=output_dim, average='macro')


        loss.backward()
        optimizer.step()

        epoch_loss += loss
        epoch_acc += acc
        epoch_f1 += f1_macro

    return epoch_loss / len(dataloader), epoch_acc / len(dataloader), epoch_f1 / len(dataloader)

In [None]:
def eval_(model, dataloader, loss_func):
    epoch_loss = 0
    epoch_acc = 0
    epoch_f1 = 0

    model.eval()

    with torch.no_grad():
        for batch in dataloader:
            inputs, labels = batch
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = loss_func(outputs, labels)
            preds = torch.argmax(outputs, dim=1)

            acc = torch.sum(preds == labels) / len(labels)
            f1_macro = multiclass_f1_score(preds, labels, num_classes=output_dim, average='macro')

            epoch_loss += loss
            epoch_acc += acc

    return epoch_loss / len(dataloader), epoch_acc / len(dataloader), epoch_f1 / len(dataloader)

In [None]:
from IPython.display import clear_output

In [None]:
N_EPOCHS = 5
cnn_model.to(device)

for epoch in range(N_EPOCHS):
    train_loss, train_acc, train_f1 = train_(cnn_model, train_loader, optimizer, loss_)
    test_loss, test_acc, test_f1 = eval_(cnn_model, test_loader, loss_)

    log_to_file(f'Epoch: {epoch+1}' +
          f'\n    Train Loss: {train_loss:.3f}, Val Loss: {test_loss:.4f}' +
          f'\n    Train Acc: {train_acc:.4f}, Val Acc: {test_acc:.4f}' +
          f'\n    Train F1: {train_f1:.4f}, Val F1: {test_f1:.4f}')

    clear_output(wait=True)

print("Finish!!!")

Finish!!!


Данные в логе:

Epoch: 1

    Train Loss: 3.485, Val Loss: 3.1246

    Train Acc: 0.1583, Val Acc: 0.2321

    Train F1: 0.0488, Val F1: 0.0000

Epoch: 2

    Train Loss: 3.140, Val Loss: 2.9580

    Train Acc: 0.2266, Val Acc: 0.2834

    Train F1: 0.0925, Val F1: 0.0000

Epoch: 3

    Train Loss: 2.993, Val Loss: 2.8832

    Train Acc: 0.2589, Val Acc: 0.2938

    Train F1: 0.1182, Val F1: 0.0000

Epoch: 4

    Train Loss: 2.859, Val Loss: 2.8536

    Train Acc: 0.2853, Val Acc: 0.3025

    Train F1: 0.1380, Val F1: 0.0000

Epoch: 5

    Train Loss: 2.739, Val Loss: 2.8424

    Train Acc: 0.3075, Val Acc: 0.3045

    Train F1: 0.1559, Val F1: 0.0000

как видим из-за того что в тесте есть не все классы он не может нормально посчитать f1, во в целом смотря на остальные метрики можем сказать, что f1 такой же как на train - почти 0.156 хороший результат уже лучше, чем у fasttext (0.126)


### 3. [2 балла] Используйте рекуррентные сети в качестве альтернативного продвинутого классификатора. Поэкспериментируйте с количеством и размерностью слоев и другими гиперпараметрами.


In [None]:
class RNN(nn.Module):
    def __init__(self, vocab_size, output_dim, hidden_dim=64, embedding_dim=300, n_layers=1, dropout=0):
        super(RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.embedding(x)
        output, hidden = self.rnn(x)
        x = self.fc(output[:, -1, :])
        return x

In [None]:
rnn_model = RNN(vocab_size, output_dim)
N_EPOCHS = 10
rnn_model.to(device)

for epoch in range(N_EPOCHS):
    train_loss, train_acc, train_f1 = train_(rnn_model, train_loader, optimizer, loss_)
    valid_loss, valid_acc, test_f1 = eval_(rnn_model, test_loader, loss_)

    log_to_file(f'Epoch: {epoch+1}' +
          f'\n    Train Loss: {train_loss:.3f}, Val Loss: {test_loss:.4f}' +
          f'\n    Train Acc: {train_acc:.4f}, Val Acc: {test_acc:.4f}' +
          f'\n    Train F1: {train_f1:.4f}, Val F1: {test_f1:.4f}')

    clear_output(wait=True)

print("Finish!!!")

Finish!!!


Данные в логе:

Epoch: 1

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 2

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 3

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 4

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 5

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 6

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 7

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 8

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 9

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

Epoch: 10

    Train Loss: 4.567, Val Loss: 2.8424

    Train Acc: 0.0020, Val Acc: 0.3045

    Train F1: 0.0001, Val F1: 0.0000

что-то слишком слабо, попробуем поиграться с параметрами

In [None]:
rnn_model_2 = RNN(vocab_size, output_dim, embedding_dim=100, n_layers=1, dropout=0.5)
N_EPOCHS = 10
rnn_model_2.to(device)

for epoch in range(N_EPOCHS):
    train_loss, train_acc, train_f1 = train_(rnn_model_2, train_loader, optimizer, loss_)
    valid_loss, valid_acc, test_f1 = eval_(rnn_model_2, test_loader, loss_)

    log_to_file(f'Epoch: {epoch+1}' +
          f'\n    Train Loss: {train_loss:.3f}, Val Loss: {test_loss:.4f}' +
          f'\n    Train Acc: {train_acc:.4f}, Val Acc: {test_acc:.4f}' +
          f'\n    Train F1: {train_f1:.4f}, Val F1: {test_f1:.4f}')

    clear_output(wait=True)

print("Finish!!!")

Finish!!!


Данные в логе:

Epoch: 1

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000

Epoch: 2

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0020, Val F1: 0.0000

Epoch: 3

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000

Epoch: 4

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000

Epoch: 5

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000

Epoch: 6

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000

Epoch: 7

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000

Epoch: 8

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000

Epoch: 9

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000

Epoch: 10

    Train Loss: 4.532, Val Loss: 2.8424

    Train Acc: 0.0272, Val Acc: 0.3045

    Train F1: 0.0019, Val F1: 0.0000
    

Как видим все еще очень слабо (

### 4. [1.5 балла] Попробуйте расширить обучающее множество за счет аугментации данных. Если вам понадобится словарь синонимов, можно использовать WordNet (ниже вы найдете примеры).


In [11]:
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
import random

def generate_synonyms(word):
    synonyms = []
    for synset in wordnet.synsets(word):
        for lemma in synset.lemmas():
            synonyms.append(lemma.name())
    return synonyms

def augment_data(text):
    tokens = word_tokenize(text)
    augmented_tokens = []
    for token in tokens:
        synonyms = generate_synonyms(token)
        if len(synonyms) > 0:
            augmented_tokens.append(random.choice(synonyms))
        else:
            augmented_tokens.append(token)
    augmented_text = ' '.join(augmented_tokens)
    return augmented_text.lower()

Частью задания не является использовать расширенное множество, поэтому вот просто пример расширения выборки для самых редких классов

In [None]:
df_train.spell.value_counts(ascending=True)[0:5]

PESKIPIKSI_PESTERNOMI     4
METEOLOJINX_RECANTO      10
PROTEGO_HORRIBILIS       15
CAVE_INIMICUM            16
DESCENDO                 17
Name: spell, dtype: int64

In [12]:
print(augment_data(df_train[df_train.spell == 'PESKIPIKSI_PESTERNOMI'].reset_index().loc[0,'text']))

. the print ( stab in every steering like rocket . several scoot straight done the window , shower the back row with demote glass . the roost go_on to wreck the classroom more_than efficaciously than deoxyadenosine_monophosphate rampage rhino . they seize ink bottle and spray the category with them , shred book and paper , shoot_down mental_picture from the wall , up-ended the squander hoop , grab travelling_bag and book and cast them out of the blotto window . hermione trilled her heart and merely left the classroom . `` come on now - round them upward , one_shot them upwardly , they 're lone pixie , `` lockhart yelled eastern_samoa she bequeath . helium roll up his sleeve , brandish his wand , and bellow , ``


In [13]:
print(augment_data(df_train[df_train.spell == 'METEOLOJINX_RECANTO'].reset_index().loc[0,'text']))

, neville say . `` oh , we already know that . we proficient indiana hogwarts express . `` , say put_on . `` okey . future practice tomorrow . so roast , estimable get some rest . `` , neville say . everyone travel back to their dorm . `` schoolmaster , fire you edward_thatch maine this spell ? `` , ask wear before go_away . helium lend out his turn record_book . `` meteolojinx_recanto ? `` , need neville . `` yes . possibly i 'll usance it sometime in the future . `` , don respond . `` this embody for fifth years . merely okeh . `` , answer neville . `` meteolojinx_recanto . `` , pronounce don . it be abortive . ``


In [14]:
print(augment_data(df_train[df_train.spell == 'PROTEGO_HORRIBILIS'].reset_index().loc[0,'text']))

and the dark overlord leave birth thrower ! then wholly of the half-bloods , and mudbloods and your tolerant , `` helium hesitate , vitamin_a sneer loop his lip . `` the filthy , wicked half-breed ! they 'll all constitute extend ! `` `` remus ! `` an nervous vocalise holler from behind lupin . atomic_number_2 turn and check nymphadora unravel up to him , a half-second earlier he earn his mistake . and that single moment follow his ruination . turn back he run_into in_that_location be two stripe of royal flame tear across the pitch : unmatched headed for himself , and the other point straight for nymphadora . without amp second 's indisposition , lupine take his verge improving erstwhile sir_thomas_more and bawl , ``


In [15]:
print(augment_data(df_train[df_train.spell == 'CAVE_INIMICUM'].reset_index().loc[0,'text']))

mightily go_up a big hayfield . the whiz above them winkle into the dark arsenic the full moon settle itself lazily into the night toss . the beam represent silent leave_off for the laboured pass_off of the terzetto son . `` buckeye_state , thank goodness ... `` colin breathe fall overly his articulatio_genus . justin let turn of lever who feature stock-still hunched o'er , rank taciturnly , and receive up , gimpiness all_over to angstrom_unit spot adenine few foot away and start to paseo about in axerophthol heavy dress_circle around the arena , flourish his wand over his point and murmur entirely the protection spell hermione farmer make list low for him via da -lrb- dumbledore 's army -rrb- galleon . `` . . . protego_totalum ...



[бонус] Используйте результат max pooling'а как эмбеддинг входного текста. Визуализируйте эмбеддинги 500-1000 предложений из обучающего множества и изучите свойства получившегося пространства.



[бонус] Используйте ваш любимый классификатор и любые (честные) способы повышения качества классификации и получите macro $F_1$ больше 0.5.



## Часть 4. [0.5 балла] Итоги
Напишите краткое резюме проделанной работы. Читали ли вы сами Гарри Поттера или фанфики о нем и помогло ли вам знание предметной области в выполнении домашнего задания?


В целом довольно интересная домашка, хотелось бы побольше заданий на реализацию различных подходов в предобработке данных. Гарри Поттера и фанфики о нем не читал => знаний в данной предметной области не имею. Было интересно повозиться с нейронками. А также получил невероятную приятность, когда на конце второго задания понял, что в df есть дубли! Nice texts!


Вывод по моделям:

Fasttext - nice!

CNN - nice!

RNN - not nice!


## Бонусная часть. [2 балла] Skip-Gram Negative Sampling
Самостоятельно реализовать и обучить модель Skip-Gram Negative Sampling. Продемонстрировать качество полученных представлений на конкретный примерах.

In [80]:
from nltk.tokenize import word_tokenize

df = pd.read_csv('dataframe.csv')
df['lower_text_no_punct'] = df.clean_text.str.lower().apply(remove_punct)
df['lower_tokenize'] = df['lower_text_no_punct'].parallel_apply(word_tokenize)

corpus = df['lower_tokenize'].to_list()

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=4528), Label(value='0 / 4528'))), …

In [148]:
import scipy
from sklearn.metrics.pairwise import cosine_similarity
import math

def sigmoid(x):
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

# def sigmoid(x):
#   return 1 / (1 + np.exp(-x))

In [149]:
class SkipGramNegativeSampling:
    def __init__(self, window_size, embedding_dim, num_negative_samples):
        self.window_size = window_size  # размер окна
        self.embedding_dim = embedding_dim  # размерность векторов эмбеддингов
        self.num_negative_samples = num_negative_samples  # количество отрицательных сэмплов
        
        self.word2id = {}   # словарь соответствия слов к их id
        self.id2word = {}   # словарь соответствия id к словам
        self.word_freq = {}  # словарь частоты слов
        
        self.word_vectors = None  # матрица векторов эмбеддингов
    
    def fit(self, corpus):
        # Обработка корпуса
        self.build_vocabulary(corpus)
        
        # Инициализация матрицы векторов эмбеддингов
        self.word_vectors = np.random.randn(len(self.word2id), self.embedding_dim)
        
        # Обучение модели
        for sentence in tqdm(corpus):
            for i, word in enumerate(sentence):
                context_words = self.get_context_words(sentence, i)
               
                if len(context_words) == 0:
                    continue
                    
                target_word = self.word2id[word]
                
                for context_word in context_words:
                    context_word_id = self.word2id[context_word]
                    
                    # Обновление векторов
                    self.update_vectors(target_word, context_word_id, 1)
                    
                    # Negative Sampling
                    negative_samples = self.get_negative_samples(target_word)
                    
                    for negative_sample in negative_samples:
                        negative_word_id = self.word2id[negative_sample]
                        
                        # Обновление векторов для отрицательных сэмплов
                        self.update_vectors(target_word, negative_word_id, 0)
    
    def build_vocabulary(self, corpus):
        idx = 0
        for sentence in corpus:
            for word in sentence:
                if word not in self.word2id:
                    self.word2id[word] = idx
                    self.id2word[idx] = word
                    self.word_freq[word] = 1
                    idx += 1
                else:
                    self.word_freq[word] += 1
    
    def get_context_words(self, sentence, target_word_idx):
        context_words = []
        start = max(0, target_word_idx - self.window_size)
        end = min(len(sentence) - 1, target_word_idx + self.window_size)
        
        for i in range(start, end+1):
            if i != target_word_idx:
                context_words.append(sentence[i])
        
        return context_words
    
    def update_vectors(self, target_word_id, context_word_id, label):
        target_vector = self.word_vectors[target_word_id]
        context_vector = self.word_vectors[context_word_id]
        
        error = label - sigmoid(np.dot(target_vector, context_vector.T))
        
        target_grad = error * context_vector
        context_grad = error * target_vector
        
        self.word_vectors[target_word_id] += target_grad
        self.word_vectors[context_word_id] += context_grad
    
    def get_negative_samples(self, target_word_id):
        negative_samples = []
        
        while len(negative_samples) < self.num_negative_samples:
            sample = random.choice(list(self.word_freq.keys()))
            if sample != self.id2word[target_word_id]:
                negative_samples.append(sample)
        
        return negative_samples
    
    def get_word_vector(self, word):
        return self.word_vectors[self.word2id[word]]

In [159]:
sgns = SkipGramNegativeSampling(
    window_size=5,
    embedding_dim=100, 
    num_negative_samples=100
    )

sgns.fit(corpus[:3]) # взял малую часть корпуса для примера

  np.exp(x) / (1 + np.exp(x)))
  np.exp(x) / (1 + np.exp(x)))
  1 / (1 + np.exp(-x)),
100%|██████████| 3/3 [00:21<00:00,  7.08s/it]


In [175]:
word1, word2 = ('man', 'boy')

In [173]:
vec1 = sgns.get_word_vector(word1)
vec2 = sgns.get_word_vector(word2)
                
similarity_score = cosine_similarity([vec1], [vec2])[0][0]
print(f"{similarity_score:.04f}")

0.2633


Видимо подобрали не лучшие параметры 

In [165]:
sgns_ = SkipGramNegativeSampling(
    window_size=4,
    embedding_dim=300, 
    num_negative_samples=50
    )

sgns_.fit(corpus[:100])

  np.exp(x) / (1 + np.exp(x)))
  np.exp(x) / (1 + np.exp(x)))
  1 / (1 + np.exp(-x)),
100%|██████████| 100/100 [14:59:16<00:00, 539.57s/it]   


In [181]:
vec1_ = sgns_.get_word_vector(word1)
vec2_ = sgns_.get_word_vector(word2)
                
similarity_score_ = cosine_similarity([vec1], [vec2])[0][0]
print(f"{similarity_score_:.04f}")

0.8726
