#### Домашнее задание 2
## Named Entety Recognition and Event Extraction from Literary Fiction

## ГРУППА: Гончаров Михаил, Павлова Арина, Разумовский Дмитрий

В этом домашнем задании вы будете работать с корпусом LitBank. Корпус собран из популярных художественных произведений на английском языке и содержит разметку по именованным сущностям и событиям. Объем корпуса таков: 100 текстов по примерно 2000 слов каждый.

Корпус описан в статьях:
* David Bamman, Sejal Popat, Sheng Shen, An Annotated Dataset of Literary Entities http://people.ischool.berkeley.edu/~dbamman/pubs/pdf/naacl2019_literary_entities.pdf
* Matthew Sims, Jong Ho Park, David Bamman, Literary Event Detection,  http://people.ischool.berkeley.edu/~dbamman/pubs/pdf/acl2019_literary_events.pdf

Корпус доступен в репозитории проекта:  https://github.com/dbamman/litbank

Статья и код, использованный для извлечения именованных сущностей:
* Meizhi Ju, Makoto Miwa and Sophia Ananiadou, A Neural Layered Model for Nested Named Entity Recognition, https://github.com/meizhiju/layered-bilstm-crf

Структура корпуса устроена так.
Первый уровень:
* entities -- разметка по сущностям
* events -- разметка по сущностям


В корпусе используются 6 типов именованных сущностей: PER, LOC, ORG, FAC, GPE, VEH (имена, локации, организации, помещения, топонимы, средства перемещния), допускаются вложенные сущности.

События выражается одним словом - *триггером*, которое может быть глагом, прилагательным и существительным. В корпусе описаны события, которые действительно происходят и не имеют гипотетического характера.
Пример: she *walked* rapidly and resolutely, здесь *walked* -- триггер события. Типы событий не заданы.



Второй уровень:
* brat -- рабочие файлы инструмента разметки brat, ann-файлы содержат разметку, txt-файлы – сырые тексты
* tsv -- tsv-файлы содержат разметку в IOB формате,


В статье и репозитории вы найдете идеи, которые помогут вам выполнить домашнее задание. Их стоит воспринимать как руководство к действию, и не стоит их копировать и переиспользовать. Обученные модели использовать не нужно, код для их обучения можно использовать как подсказку.

## ПРАВИЛА

1. Домашнее задание можно выполнять в группе до 3-х человек.
2. Домашнее задание сдается через anytask.
3. Домашнее задание оформляется в виде отчета либо в .pdf файле, либо ipython-тетрадке.
4. Отчет должен содержать: нумерацию заданий и пунктов, которые вы выполнили, код решения, и понятное пошаговое описание того, что вы сделали. Отчет должен быть написан в академическом стиле, без излишнего использования сленга и с соблюдением норм русского языка.
5. Не стоит копировать фрагменты лекций, статей и Википедии в ваш отчет.
6. Отчеты, состоящие исключительно из кода, не будут проверены и будут автоматически оценены нулевой оценкой.
7. Плагиат и любое недобросоветсное цитирование приводит к обнуление оценки.

## Часть 1. [2 балла] Эксплоративный анализ

In [None]:
!git clone https://github.com/dbamman/litbank.git

Cloning into 'litbank'...
remote: Enumerating objects: 1187, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 1187 (delta 26), reused 22 (delta 22), pack-reused 1149[K
Receiving objects: 100% (1187/1187), 40.65 MiB | 17.43 MiB/s, done.
Resolving deltas: 100% (152/152), done.
Updating files: 100% (1423/1423), done.


In [None]:
import os

In [None]:
root = '/content/litbank/entities/brat'
files = []

for dirpath, dirnames, filenames in os.walk(root):
  for filename in filenames:
    if filename.endswith('ann'):
      files.append(filename)

files = sorted(files)

In [None]:
from tqdm import tqdm
import pandas as pd
from collections import defaultdict, Counter

1. Найдите топ 10 (по частоте) именованных сущностей каждого из 6 типов.

In [None]:
ent = defaultdict(Counter)
for file in tqdm(files):
    data = pd.read_csv(os.path.join(root, file), sep='\t', header = None, quoting=3)
    data[1] = data[1].apply(lambda x: x.split(' ')[0])
    for i, row in data.iterrows():
        ent[row[1]][row[2]] += 1
data

100%|██████████| 100/100 [00:01<00:00, 90.89it/s]


Unnamed: 0,0,1,2
0,T1,PER,Anthony Patch
1,T3,PER,intelligent men
2,T7,PER,a man
3,T8,PER,the crusaders
4,T9,PER,Anthony
...,...,...,...
166,T162,LOC,the world
167,T168,PER,the world
168,T169,PER,any one who interrupted him at play
169,T170,PER,every one who came into his bedroom


In [None]:
for _ in ent.keys():
    print(_)
    print(*ent[_].most_common(10), sep='\n')
    print('\n')

FAC
('home', 65)
('the house', 52)
('here', 39)
('there', 39)
('the room', 34)
('the garden', 23)
('the street', 14)
('the hall', 13)
('the road', 13)
('the place', 12)


LOC
('the world', 72)
('the sea', 27)
('the river', 22)
('the country', 20)
('there', 18)
('the earth', 16)
('sea', 16)
('the valley', 13)
('this world', 12)
('the woods', 9)


GPE
('London', 40)
('England', 32)
('there', 21)
('the town', 21)
('New York', 16)
('town', 14)
('France', 14)
('Europe', 12)
('the country', 10)
('Rome', 10)


VEH
('the ship', 11)
('the car', 9)
('the train', 6)
('the boat', 4)
('boats', 4)
('the carriage', 3)
('ships', 3)
('a carriage', 3)
('the waggon', 3)
('the coach', 3)


PER
('Mr.', 148)
('Miss', 133)
('Mrs.', 132)
('sir', 50)
('Sir', 45)
('men', 40)
('my mother', 40)
('Cameron', 38)
('his wife', 37)
('Mr', 37)


ORG
('the army', 7)
('the Church', 4)
('the Committee of Public Safety', 4)
('the Colonial Office', 4)
('college', 3)
('Harvard', 3)
('Carston , Waite and Co.', 2)
('the hospit

2. Найдите топ 10 (по частоте) частотных триггеров событий.

In [None]:
root = '/content/litbank/events/brat'
files = []

for dirpath, dirnames, filenames in os.walk(root):
  for filename in filenames:
    if filename.endswith('ann'):
      files.append(filename)
files = sorted(files)
events = Counter()

In [None]:
for file in tqdm(files):
    try:
        data = pd.read_csv(os.path.join(root, file), sep='\t', header = None, quoting=3)
    except pd.errors.EmptyDataError:
        continue
    data[1] = data[1].apply(lambda x: x.split(' ')[0])
    for i, row in data.iterrows():
        events[row[2]] += 1

100%|██████████| 100/100 [00:00<00:00, 149.43it/s]


In [None]:
print(*events.most_common(10), sep='\n')

('said', 464)
('came', 95)
('looked', 92)
('went', 92)
('asked', 69)
('heard', 63)
('saw', 59)
('cried', 59)
('took', 56)
('turned', 55)


3. Кластеризуйте все уникальные триггеры событий, используя эмбеддинги слов и любой алгоритм кластеризации (например, агломеративный иерархический алгоритм кластеризации) и попробуйте проинтерпретировать кластеры: есть ли очевидные типы событий?

In [None]:
root = '/content/litbank/events/tsv'
files = sorted(list(os.walk(root))[0][-1])

In [None]:
from string import punctuation
import csv
import random

def train_dev_test_split(data, train_percent = 80, dev_percent = 10, test_percent = 10):
  random.shuffle(data)
  train_size = int(len(data) * train_percent / 100)
  train_data = data[:train_size]
  dev_size = int(len(data) * dev_percent / 100)
  dev_data = data[train_size:train_size+dev_size]
  test_size = int(len(data) * test_percent / 100)
  test_data = data[train_size+dev_size:]
  return train_data, dev_data, test_data

def prepare_sents(root, files):
    sents, labels = [], []
    for f in tqdm(files):
        path = os.path.join(root, f)
        df = pd.read_csv(path, sep='\t', quoting=csv.QUOTE_NONE, header=None)
        words = list(df[0])
        labels = list(df[1])

        sentences = [[]]
        labels_sent = [[]]

        for i in range(len(words)):
            if words[i] not in ['.', '!', '?', '...', '', ',']:
                if words[i] in punctuation:
                    continue
                sentences[-1].append(words[i].lower())
                labels_sent[-1].append(labels[i])
            elif sentences[-1] != []:
                sentences.append([])
                labels_sent.append([])

        if sentences[-1] == []:
            sentences = sentences[:-1]
            labels_sent = labels_sent[:-1]
        sents += sentences
        labels += labels_sent

    return sents, labels

In [None]:
sentences, labels_sentences = prepare_sents(root, files)

100%|██████████| 100/100 [00:00<00:00, 102.98it/s]


In [None]:
train_data, dev_data, test_data = train_dev_test_split(files, 80, 10, 10)

In [None]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4199769 sha256=895710abf47d2db207918edc3c98351d2c24894077e68022c5fb38ae4a2791b8
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.11.1


In [None]:
with open('train_text.txt', 'w') as f:
    for s in sentences:
        f.write(' '.join(s))
        f.write('\n')

In [None]:
import fasttext
ft = fasttext.train_unsupervised('train_text.txt')

In [None]:
import numpy as np

trigger_embeddings = []
triggers = []

for trigger in events.keys():
    triggers.append(trigger)
    trigger_embeddings.append(ft.get_word_vector(trigger.lower()))

trigger_embeddings = np.array(trigger_embeddings)

In [None]:
from sklearn.cluster import AgglomerativeClustering

clustering = AgglomerativeClustering(n_clusters = 3)
clustering.fit(trigger_embeddings)


In [None]:
clusters = {}
for word, cluster in zip(triggers, clustering.labels_):
    if cluster not in clusters:
        clusters[cluster] = [word]
    else:
        clusters[cluster].append(word)

for cluster, words in clusters.items():
    print(f"Cluster {cluster}: {words}")

Cluster 1: ['Smoke', 'pinching', 'addressed', 'contemplation', 'engaged', 'mentioned', 'come', 'keeps', 'observed', 'correcting', 'acquired', 'infection', 'see', 'pretence', 'temper', 'born', 'married', 'died', 'issue', 'Married', 'lost', 'settled', 'death', 'introduction', 'excursions', 'seen', 'expected', 'retired', 'brought', 'resolved', 'precautions', 'provided', 'entertained', 'ball', 'held', 'illumined', 'appearances', 'projected', 'revel', 'glare', 'glitter', 'fancies', 'writhed', 'causing', 'voice', 'echoes', 'chime', 'die', 'laughter', 'depart', 'swells', 'appals', 'peal', 'indulged', 'reaches', 'beat', 'ceased', 'music', 'quieted', 'cessation', 'rumour', 'presence', 'murmur', 'disapprobation', 'surprise', 'terror', 'horror', 'disgust', 'made', 'disconcert', 'meditation', 'told', 'pause', 'performance', 'harken', 'constrained', 'aware', 'nod', 'inclined', 'came', 'met', 'said', 'reply', 'hesitation', 'discovery', 'give', 'concluded', 'explained', 'asked', 'doubts', 'gone', 'ap

[бонус] Визуализируйте полученные кластеры с помощью TSNE или UMAP

In [None]:
from sklearn.manifold import TSNE

In [None]:
tsne = TSNE(n_components=2, random_state=42)
embed_tsne = tsne.fit_transform(trigger_embeddings)

In [None]:
import plotly.express as px

In [None]:
fig = px.scatter(x=embed_tsne[:,0], y=embed_tsne[:,1], text=triggers, color=clustering.labels_)
fig.update_traces(textfont_size=2)
fig.show()

Судя по всему, в красном кластере оказались эмоции, в желтом - физические действия, а в синем состояния.

## Часть 2. [3 балла] Извлечение именованных сущностей
1. Обучите стандартную модель для извлечения именованных сущностей, CNN-BiLSTM-CRF, для извлечения именованных *низкоуровневых именованных сущностей*, т.е. для самых коротких из вложенных сущностей.
Модель устроена так: сверточная сеть на символах + эмбеддинги слов + двунаправленная LSTM сеть (модель последовательности) + CRF (глобальная нормализация).
2. Замените часть модели на символах и словах (CNN + эмбеддинги словах) на ELMo и / или BERT. Должна получиться модель ELMo / BERT + BiLSTM + CRF.
3. Замените модель последовательности (BiLSTM) на другой слой, например, на Transformer. Должна получиться модель CNN  + Transformer + CRF.

[бонус] Дообучите BERT для извлечения именованных сущностей.

[бонус] Используйте модель для извлечения вложенных именованных сущностей [Ju et al., 2018]

[бонус] Модифицируйте модель для извлечения вложенных именованных сущностей [Ju et al., 2018]: вместо эмбеддингов слов используйте ELMo и/или BERT.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
from typing import List
from typing import Union, Tuple, Dict
import numpy as np
import torch
from sklearn.metrics import f1_score, classification_report

In [None]:
!git clone https://github.com/dbamman/litbank.git

fatal: destination path 'litbank' already exists and is not an empty directory.


In [None]:
import os

In [None]:
root = '/content/litbank/entities/tsv'
files = sorted(list(os.walk(root))[0][-1])

In [None]:
import random
def train_dev_test_split(data, train_percent = 80, dev_percent = 10, test_percent = 10):
  random.shuffle(data)
  train_size = int(len(data) * train_percent / 100)
  train_data = data[:train_size]
  dev_size = int(len(data) * dev_percent / 100)
  dev_data = data[train_size:train_size+dev_size]
  test_size = int(len(data) * test_percent / 100)
  test_data = data[train_size+dev_size:]
  return train_data, dev_data, test_data

In [None]:
import os
import pandas as pd
from tqdm import tqdm
from string import punctuation

def prepare_sents(root, files, verbose=True):
    sents, labels = [], []
    for f in tqdm(files, disable=not verbose):
        path = os.path.join(root, f)
        with open(path) as file:
            cont = file.readlines()
        sent_temp = []
        labels_temp = []
        for s in cont:
            if s == '\n':
                if sent_temp:
                    sents.append(sent_temp)
                    labels.append(labels_temp)
                sent_temp = []
                labels_temp = []
            else:
                s = s.split('\t')
                sent_temp.append(s[0])
                labels_temp.append(s[1])
        if sent_temp:
            sents.append(sent_temp)
            labels.append(labels_temp)
    return sents, labels

In [None]:
root = '/content/litbank/entities/tsv'
files = sorted(list(os.walk(root))[0][-1])
train_data, dev_data, test_data = train_dev_test_split(files, 80, 10, 10)
train_books = [os.path.join(root, file) for file in train_data]
dev_books = [os.path.join(root, file) for file in dev_data]
test_books = [os.path.join(root, file) for file in test_data]

train_sents, train_labels = prepare_sents(root, train_books)
dev_sents, dev_labels = prepare_sents(root, dev_books)
test_sents, test_labels = prepare_sents(root, test_books)

df_train = pd.DataFrame(zip(train_sents, train_labels), columns=['sentence', 'labels'])
df_dev = pd.DataFrame(zip(dev_sents, dev_labels), columns=['sentence', 'labels'])
df_test = pd.DataFrame(zip(test_sents, test_labels), columns=['sentence', 'labels'])

100%|██████████| 80/80 [00:00<00:00, 179.11it/s]
100%|██████████| 10/10 [00:00<00:00, 521.41it/s]
100%|██████████| 10/10 [00:00<00:00, 454.03it/s]


In [None]:
df_train['chars'] = df_train['sentence'].apply(lambda x: [list(y) for y in x])
df_dev['chars'] = df_dev['sentence'].apply(lambda x: [list(y) for y in x])
df_test['chars'] = df_test['sentence'].apply(lambda x: [list(y) for y in x])

In [None]:
def build_token_dict(tokens: List, special_tokens: List) -> Tuple[Dict]:
    """
    Build a dictionary for tokens.

    Args:
    - tokens: A list of lists of tokens.
    - special_tokens: A list of special tokens.

    Returns:
    - token2idx: A dictionary mapping tokens to indices.
    - idx2token: A list of tokens sorted by indices.
    """
    token2idx = defaultdict(lambda: 0)
    idx2token = []

    for index, stoken in enumerate(special_tokens):
        token2idx[stoken] = index

    unique_tokens = set()
    for item in tokens:
        for token in item:
            if token not in special_tokens:
                unique_tokens.add(token)

    for index, utoken in enumerate(unique_tokens, len(special_tokens)):
        token2idx[utoken] = index

    sorted_dict = sorted(token2idx.items(), key=lambda x: x[1])

    for elem in sorted_dict:
        idx2token.append(elem[0])

    return token2idx, idx2token


def create_embedding_matrix(word_vecs, token2idx: Dict, emb_size: int = 300, special_ids: Tuple[int] = (0, 1)) -> np.ndarray:
    """
    Create an embedding matrix from word vectors.

    Args:
    - word_vecs: Word vectors.
    - token2idx: Token to index mapping.
    - emb_size: Size of the embeddings.
    - special_ids: Special token ids.

    Returns:
    - emb_matrix: Embeddings matrix.
    """
    emb_matrix = np.zeros((len(token2idx), emb_size), dtype="float32")

    emb_matrix[special_ids[0]] = np.zeros(emb_size, dtype='float32')
    emb_matrix[special_ids[1]] = np.random.uniform(-0.25, 0.25, emb_size)

    for token, id in token2idx.items():
        if id not in special_ids:
            if token in word_vecs:
                emb_matrix[id] = word_vecs[token]
            else:
                emb_matrix[id] = np.random.uniform(-0.25, 0.25, emb_size)
    return emb_matrix

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [None]:
def label_prediction(model, iterator, TAG_PAD_IDX=0, exclude_pad=True, device='cpu'):

    model.eval()

    tokens, labels, pred_labels = [], [], []
    with torch.no_grad():
        for batch in iterator:
            predictions, _ = model(batch['tokens'].to(device), batch['chars'].to(device), batch['mask'].to(device), batch['labels'].to(device))

            for sent_texts, sent_tags_true, preds in zip(batch['tokens'], batch['labels'], predictions):
                sent_tags_true = sent_tags_true.cpu().numpy()
                sent_texts = sent_texts.cpu().numpy()
                preds = preds.cpu().numpy()
                if exclude_pad:
                    args = np.where(sent_texts != TAG_PAD_IDX)[0]

                    tokens.append(list(sent_texts[args]))
                    labels.append(list(sent_tags_true[args]))
                    pred_labels.append(list(preds[args]))

    return tokens, labels, pred_labels

In [None]:
def trainer(model, train_dataloader, valid_dataloader, optimizer, idx2tag, epochs=5, pad_idx=0, clip=1):

    best_acc = 0
    best_loss = 1000

    for epoch_num in range(epochs):

        classes = []
        predicted_classes = []
        epoch_acc_train = 0
        epoch_loss_train = 0

        model.train()

        for batch in tqdm(train_dataloader):
            optimizer.zero_grad()
            predictions, loss = model(batch['tokens'].to(device), batch['chars'].to(device), batch['mask'].to(device), batch['labels'].to(device))

            train_label = batch['labels'].to(device).view(-1).cpu()
            predictions = predictions.view(-1).cpu()
            texts = batch['tokens'].to(device).view(-1).cpu()

            true_pred = ((train_label == predictions) & (texts != pad_idx)).sum()
            num_pred = (texts != pad_idx).sum()
            acc = true_pred / num_pred
            epoch_acc_train += acc

            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip)
            optimizer.step()
            epoch_loss_train += loss.item()

        train_accuracy = epoch_acc_train / len(train_dataloader)
        train_loss = epoch_loss_train / len(train_dataloader)

        model.eval()

        epoch_acc_valid = 0
        epoch_loss_valid = 0

        for batch in valid_dataloader:
            optimizer.zero_grad()
            predictions, loss = model(batch['tokens'].to(device), batch['chars'].to(device), batch['mask'].to(device), batch['labels'].to(device))

            epoch_loss_valid += loss.item()

            valid_label = batch['labels'].to(device).view(-1).cpu()
            predictions = predictions.view(-1).cpu()
            texts = batch['tokens'].to(device).view(-1).cpu()

            true_pred = ((valid_label == predictions) & (texts != pad_idx)).sum()
            num_pred = (texts != pad_idx).sum()
            acc = true_pred / num_pred
            epoch_acc_valid += acc

            predicted_classes.extend(predictions.cpu().numpy())
            classes.extend(valid_label.cpu().numpy())

        valid_accuracy = epoch_acc_valid / len(valid_dataloader)
        valid_loss = epoch_loss_valid / len(valid_dataloader)

        print(
            f'Epochs: {epoch_num + 1} | Accuracy: {valid_accuracy: .3f}')

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
class CustomDataset(Dataset):

  def __init__(self,
               df: pd.DataFrame,
               token2id: Dict,
               label2id: Dict,
               char2id: Dict,
               text_col: str='context_processed',
               char_col: str='characters',
               label_col: str ='spell',
               pad_to: int = 124,
               pad_to_char: int = 30,
               pad_value: int = 0,
               unk_value: int = 1,
               pad_left: bool = False,
               pad_left_char: bool = False,
               is_char=True
               ) -> None:

    self.tokens = df[text_col].tolist()
    self.chars = df[char_col].tolist()
    self.labels = df[label_col].tolist()

    self.label2id = label2id
    self.token2id = token2id
    self.char2id = char2id

    self.pad_to = pad_to
    self.pad_to_char = pad_to_char
    self.pad_left_char = pad_left_char
    self.pad_value = pad_value
    self.pad_left = pad_left
    self.unk_value = unk_value
    self.is_char = is_char

  def __getitem__(self, ix: int) -> Dict:

    """
    :param ix: index of object
    :return object prepared for training
    """
    if self.is_char:
      tokens, chars, tags = self.tokens[ix], self.chars[ix], self.labels[ix]
      tokens, chars, tags = self.transform(data=tokens, kind='token'), self.transform(data=chars, kind='char', is_char=True), self.transform(data=tags, kind='tag')
      mask = [0 if token == self.pad_value else 1 for token in tokens]
      mask = torch.tensor(mask, dtype=torch.bool)
      res = {'tokens': tokens, 'chars': chars, 'labels': tags, 'mask': mask}
    else:

      tokens, tags = self.tokens[ix], self.labels[ix]
      tokens, tags = self.transform(data=tokens), self.transform(data=tags)
      res = {'tokens': tokens, 'labels': tags}
    return res

  def __len__(self):
    return len(self.tokens)


  def transform(self,  data: List, kind: str = 'token', is_char: bool = False):

    """
    add padding to data
    :param data: data to process
    :param is_char: whether add chars
    :return tensor prepared for training
    """
    if kind == 'token':
      mapping = self.token2id
    elif kind == 'char':
      mapping = self.char2id
    else:
      mapping = self.label2id

    if not is_char:
      if self.pad_to is not None and len(data) != self.pad_to:
        if len(data) > self.pad_to:
          data = [mapping[i] if i in mapping else self.unk_value for i in data[:self.pad_to]]
        else:
          n_pads = self.pad_to - len(data)
          if kind == 'token':
            data = [self.pad_value]*n_pads*self.pad_left + [mapping[i] if i in mapping else self.unk_value for i in data] + [self.pad_value]*n_pads*(not self.pad_left)
          else:
            data = [mapping['O']]*n_pads*self.pad_left + [mapping[i] if i in mapping else self.unk_value for i in data] + [mapping['O']]*n_pads*(not self.pad_left)


      try:
        return torch.tensor(data, dtype=torch.int)
      except:
        print(data)

    else:
      chars_embeddings = []
      for word in data:
        if self.pad_to_char is not None and len(word) != self.pad_to_char:
          if len(word) > self.pad_to_char:
            word = [mapping[i] if i in mapping else self.unk_value for i in word[:self.pad_to_char]]
          else:
            n_pads = self.pad_to_char - len(word)
            word = [self.pad_value]*n_pads*self.pad_left + [mapping[i] if i in mapping else self.unk_value for i in word] + [self.pad_value]*n_pads*(not self.pad_left_char)
        else:
          word = [mapping[i] if i in mapping else self.unk_value for i in word]
        chars_embeddings.append(word)
      if self.pad_to is not None and len(chars_embeddings) != self.pad_to:
        if len(chars_embeddings) > self.pad_to:
          chars_embeddings = chars_embeddings[:self.pad_to]
        else:
          n_pads = self.pad_to - len(chars_embeddings)
          chars_embeddings = [[self.pad_value]*self.pad_to_char]*n_pads*self.pad_left + chars_embeddings + [[self.pad_value]*self.pad_to_char]*n_pads*(not self.pad_left)

      try:
        return torch.tensor(chars_embeddings, dtype=torch.int)
      except:
        print(chars_embeddings)

In [None]:
from collections import Counter, defaultdict

In [None]:
token2idx, idx2token = build_token_dict(df_train['sentence'].tolist() + df_dev['sentence'].tolist(), [ '<PAD>', '<UNK>'])
tag2idx, idx2tag = build_token_dict(df_train["labels"], [])
char2idx, idx2char = build_token_dict(df_train['sentence'].tolist() + df_dev['sentence'].tolist(), [ '<PAD>', '<UNK>'])

In [None]:
dataset_train = CustomDataset(df=df_train, token2id=token2idx, label2id=tag2idx, char2id=char2idx, text_col='sentence', label_col='labels', char_col='chars')
dataset_dev = CustomDataset(df=df_dev, token2id=token2idx, label2id=tag2idx, char2id=char2idx, text_col='sentence', label_col='labels', char_col='chars')
dataset_test = CustomDataset(df=df_test, token2id=token2idx, label2id=tag2idx, char2id=char2idx, text_col='sentence', label_col='labels', char_col='chars')

In [None]:
BATCH_SIZE = 32
train_datloader = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)
dev_dataloader = DataLoader(dataset_dev, batch_size=BATCH_SIZE, shuffle=False)
test_dataloader = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=False)

In [None]:
! wget https://www.dropbox.com/s/699kgut7hdb5tg9/GoogleNews-vectors-negative300.bin.gz?dl=1
! mv 'GoogleNews-vectors-negative300.bin.gz?dl=1' GoogleNews-vectors-negative300.bin.gz
! gunzip GoogleNews-vectors-negative300.bin.gz

--2023-11-19 19:48:10--  https://www.dropbox.com/s/699kgut7hdb5tg9/GoogleNews-vectors-negative300.bin.gz?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/699kgut7hdb5tg9/GoogleNews-vectors-negative300.bin.gz [following]
--2023-11-19 19:48:10--  https://www.dropbox.com/s/dl/699kgut7hdb5tg9/GoogleNews-vectors-negative300.bin.gz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucf34ee5355752cda12c4539d8d8.dl.dropboxusercontent.com/cd/0/get/CH2XSWywfn1RTKOje6S9rvI2RqIM56YR-cJTtnqqXxa5UIJ8fCSwMEok5-uM2l5yeoGFgp-gGmaG7WLeHpo6DPJS2zcROBsXrmiJ3MpZntZQ-oJTInszStonoHh8KVTlq6w/file?dl=1# [following]
--2023-11-19 19:48:10--  https://ucf34ee5355752cda12c4539d8d8.dl.dropboxusercontent.com/cd/0/get/CH2XSWywfn1RTKOje6S9rvI2RqIM56YR-cJTtnqqXxa5UI

In [None]:
import gensim

In [None]:
w2v = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
w2v_embeddings = create_embedding_matrix(w2v, token2idx, emb_size = 300)

In [None]:
! pip install pytorch-crf



In [None]:
from torchcrf import CRF
from torch import nn

In [None]:
class CNNBiLSTMCRF(nn.Module):
    def __init__(self, word2id, char2id, idx2tag, num_classes, device,
                 word_embedding_dim=300, char_embedding_dim=20, num_filters=100,
                 hidden_dim=200, num_layers=2, filter_size=3, drop_out=0.5,
                 pad_idx=0, pretrained_embedding=None, freeze_embedding=False):

        super(CNNBiLSTMCRF, self).__init__()

        self.device = device
        self.pad_idx = pad_idx
        self.idx2tag = idx2tag

        self.word_embedding = nn.Embedding(len(word2id), word_embedding_dim, padding_idx=pad_idx)

        if type(pretrained_embedding) == np.ndarray:
            self.word_embedding.weight.data.copy_(torch.from_numpy(pretrained_embedding))
            self.word_embedding.weight.requires_grad = True

        self.embedding_dim = char_embedding_dim
        self.char_embedding = nn.Embedding(len(char2id), char_embedding_dim)
        self.char_embedding.weight.requires_grad = True
        self.cnn = nn.Conv3d(in_channels=1, out_channels=num_filters, kernel_size=(1, filter_size, char_embedding_dim))

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.rnn_input_dim = word_embedding_dim + num_filters
        self.lstm = nn.LSTM(self.rnn_input_dim, hidden_dim // 2, num_layers,
                            bidirectional=True, batch_first=True, dropout=drop_out)
        self.fc = nn.Linear(hidden_dim, num_classes)

        self.crf = CRF(num_classes, batch_first=True)

    def forward(self, words_to_id, chars_to_id, mask, label=None):
        word_embedding = self.word_embedding(words_to_id.to(self.device))

        max_len, max_len_char = chars_to_id.size(1), chars_to_id.size(2)
        inputs = chars_to_id.view(-1, max_len * max_len_char)
        input_embed = self.char_embedding(inputs)
        input_embed = input_embed.view(-1, 1, max_len, max_len_char, self.embedding_dim)
        conv_output = self.cnn(input_embed)
        pool_output = torch.squeeze(torch.max(conv_output, -2)[0])
        char_embedding = pool_output.transpose(-2, -1).contiguous()

        embedding = torch.cat([word_embedding, char_embedding], 2)

        self.lstm.flatten_parameters()
        out, (h, c) = self.lstm(embedding)
        out = self.fc(out)

        if label is not None:
            loss = -self.crf(out, label, mask, reduction='mean')
        else:
            loss = None

        predicted_id = self.crf.decode(out, mask)
        seq_len = label.shape[-1]
        for i, j in enumerate(predicted_id):
            predicted_id[i] += [0] * (seq_len - mask[i, :].sum())
        predicted_id = torch.tensor(predicted_id, dtype=torch.int)

        return predicted_id, loss

    def compute_all(self, batch):
        words_to_id = batch['tokens']
        chars_to_id = batch['chars']
        label = batch['labels']
        mask = batch['mask']
        out = self.forward(words_to_id, chars_to_id)
        loss = -self.crf(out, label, mask, reduction='mean')

        seq_len = label.shape[-1]
        predicted_id = self.crf.decode(out, mask)

        for i, j in enumerate(predicted_id):
            predicted_id[i] += [0] * (seq_len - mask[i, :].sum())
        predicted_id = torch.tensor(predicted_id, dtype=torch.int).to(self.device)

        true_pred = ((label == predicted_id) & (label != self.pad_idx)).sum()
        num_pred = (label != self.pad_idx).sum()
        acc = (true_pred / num_pred).cpu()
        metrics = dict(acc=acc)

        return loss, metrics

    def decode(self, words_to_id, chars_to_id, mask):
        out = self.forward(words_to_id, chars_to_id)
        predicted_id = self.crf.decode(out, mask)
        return predicted_id

    def _eval(self, valid_dataloader):
        self.eval()

        true_labels = []
        predict_labels = []

        for batch in valid_dataloader:
            batch = {k: v.to(self.device) for k, v in batch.items()}
            words_to_id = batch['tokens'].to(self.device)
            chars_to_id = batch['chars'].to(self.device)
            label = batch['labels'].to(self.device)
            mask = batch['mask'].to(self.device)

            out = self.decode(words_to_id, chars_to_id, mask)

            for out_sentence, label_sentence in zip(out, label.tolist()):
                for predict_label, true_label in zip(out_sentence, label_sentence):
                    true_labels.append(self.idx2tag[true_label])
                    predict_labels.append(self.idx2tag[predict_label])
        report = classification_report(true_labels, predict_labels, output_dict=True)
        return report

In [None]:
N_CLASSES = len(tag2idx)
model = CNNBiLSTMCRF(token2idx,
                                  char2idx,
                                  idx2tag,
                                  N_CLASSES,
                                  device=device,
                                  hidden_dim=512,
                                  num_layers=2,
                                  drop_out=0.2,
                                  pretrained_embedding=w2v_embeddings
                                  )
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)

model.apply(init_weights)
model.to(device)

CNNBiLSTMCRF(
  (word_embedding): Embedding(16634, 300, padding_idx=0)
  (char_embedding): Embedding(16634, 20)
  (cnn): Conv3d(1, 100, kernel_size=(1, 3, 20), stride=(1, 1, 1))
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.2, bidirectional=True)
  (fc): Linear(in_features=512, out_features=13, bias=True)
  (crf): CRF(num_tags=13)
)

In [None]:
import torch.optim as optim

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
trainer(model,  train_datloader, dev_dataloader, optimizer, idx2tag, epochs=10, pad_idx=0)

100%|██████████| 218/218 [12:12<00:00,  3.36s/it]


Epochs: 1 | Accuracy:  0.899


In [None]:
tokens, labels, pred_labels = label_prediction(model, test_dataloader, TAG_PAD_IDX=0, exclude_pad=True)
flat_labels = [el for subset in labels for el in subset]
flat_preds = [el for subset in pred_labels for el in subset]

In [None]:
true_labels = [idx2tag[l] for l in flat_labels]
predict_labels = [idx2tag[l] for l in flat_preds]
labels = [i for i in idx2tag if i not in ('<PAD>', '<UNK>')]
print(classification_report(true_labels, predict_labels, labels=labels))

              precision    recall  f1-score   support

           O       0.91      0.97      0.94     17577
       B-VEH       0.00      0.00      0.00        21
       I-FAC       0.60      0.25      0.36       362
       B-FAC       0.48      0.34      0.40       221
       I-VEH       0.00      0.00      0.00        25
       B-ORG       0.00      0.00      0.00        15
       I-PER       0.60      0.39      0.47      1189
       I-ORG       0.00      0.00      0.00        28
       I-GPE       0.75      0.06      0.11        53
       B-LOC       0.92      0.13      0.22        94
       B-PER       0.62      0.57      0.59       866
       B-GPE       0.80      0.05      0.10        79
       I-LOC       0.82      0.12      0.20       120

    accuracy                           0.89     20650
   macro avg       0.50      0.22      0.26     20650
weighted avg       0.87      0.89      0.87     20650



**Bert + BiLSTMCRF**

In [None]:
class CustomBertDataset(Dataset):
    def __init__(self, df, tokenizer, label2id: Dict, text_col: str='context_processed', label_col: str='spell', max_seq_len=124):
        self.tokenizer = tokenizer
        self.label2id = label2id
        self.max_seq_len = max_seq_len
        self.processed_data = self._process_data(df, text_col, label_col)

    def _process_data(self, df, text_col, label_col):
        processed_data = []
        for text, label in zip(df[text_col].tolist(), df[label_col].tolist()):
            processed_text, processed_label = self._process_sample(text, label)
            processed_data.append((processed_text, processed_label))
        return processed_data

    def _process_sample(self, text, labels):
        tmp_input_ids = self.tokenizer.convert_tokens_to_ids(["[CLS]"] + text + ["[SEP]"])[:self.max_seq_len]
        attention_mask = [1] * len(tmp_input_ids)
        input_ids = tmp_input_ids + [0] * (self.max_seq_len - len(tmp_input_ids))
        attention_mask = attention_mask + [0] * (self.max_seq_len - len(tmp_input_ids))
        labels = [self.label2id[label] for label in labels]
        labels = [0] + labels + [0] + [0] * (self.max_seq_len - len(tmp_input_ids))
        labels = labels[:self.max_seq_len]

        processed_text = {
            "input_ids": torch.tensor(input_ids),
            "attention_mask": torch.tensor(attention_mask),
        }
        processed_label = torch.tensor(labels)
        return processed_text, processed_label

    def __len__(self):
        return len(self.processed_data)

    def __getitem__(self, index):
        return self.processed_data[index]

In [None]:
tag2idx = {tag: idx for idx, tag in enumerate(set(tag for sublist in df_train.labels.tolist() for tag in sublist), 1)}
tag2idx['<PAD>'] = 0
idx2tag = [tag for _, tag in sorted(tag2idx.items(), key=lambda x: x[1])]

In [None]:
from transformers import DistilBertTokenizer
from transformers import DistilBertModel

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
bert = DistilBertModel.from_pretrained('distilbert-base-uncased')

In [None]:
dataset_train = CustomBertDataset(df=df_train, tokenizer=tokenizer,  label2id=tag2idx, text_col='sentence', label_col='labels')
dataset_dev = CustomBertDataset(df=df_dev, tokenizer=tokenizer,  label2id=tag2idx, text_col='sentence', label_col='labels')
dataset_test = CustomBertDataset(df=df_test, tokenizer=tokenizer,  label2id=tag2idx, text_col='sentence', label_col='labels')

In [None]:
BATCH_SIZE = 32
train_datloader = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)
dev_dataloader = DataLoader(dataset_dev, batch_size=BATCH_SIZE, shuffle=False)
test_dataloader = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=False)

In [None]:
import torch
import torch.nn as nn

from torchcrf import CRF
from transformers import BertModel, BertConfig

class BertBiLSTMCRF(nn.Module):
  def __init__(self, bert, num_labels, max_seq_len=124):
    super(BertBiLSTMCRF, self).__init__()
    self.bert = bert
    hidden_size = self.bert.config.hidden_size
    self.lstm_hiden = 128
    self.max_seq_len = max_seq_len
    self.bilstm = nn.LSTM(hidden_size, self.lstm_hiden, 1, bidirectional=True, batch_first=True, dropout=0.1)
    self.linear = nn.Linear(self.lstm_hiden * 2, num_labels)
    self.crf = CRF(num_labels, batch_first=True)

  def forward(self, input_ids, attention_mask, labels=None):
    bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
    seq_out = bert_output[0]
    batch_size = seq_out.size(0)
    seq_out, _ = self.bilstm(seq_out)
    seq_out = seq_out.contiguous().view(-1, self.lstm_hiden * 2)
    seq_out = seq_out.contiguous().view(batch_size, self.max_seq_len, -1)
    seq_out = self.linear(seq_out)

    predicted_id = self.crf.decode(seq_out, mask=attention_mask.bool())
    seq_len = labels.shape[-1]
    for i, j in enumerate(predicted_id):
        predicted_id[i] += [0]*(seq_len - attention_mask[i, :].sum())
    predicted_id = torch.tensor(predicted_id, dtype=torch.int)
    loss = None
    if labels is not None:
      loss = -self.crf(seq_out, labels, mask=attention_mask.bool(), reduction='mean')
    return predicted_id, loss

In [None]:
model = BertBiLSTMCRF(bert, len(tag2idx), 124)
model.to(device);

In [None]:
weight_decay_finetune = 1e-5
learning_rate = 5e-5

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay_finetune)

In [None]:
def label_prediction(model, iterator, TAG_PAD_IDX=0, exclude_pad=True):

  model.eval()

  tokens, labels, pred_labels = [], [], []
  with torch.no_grad():
        for i, (batch, tags) in enumerate(iterator):

            texts = batch['input_ids'].to(device)
            tags = tags.to(device)
            mask = batch['attention_mask'].to(device)
            predictions, loss = model(texts, mask, tags)

            for i, j in enumerate(predictions):
              sent_tags_true = tags[i, :].cpu().numpy()
              sent_texts = texts[i, :].cpu().numpy()
              preds = predictions[i, :].cpu().numpy()
              if exclude_pad:
                args = np.where(sent_texts!=TAG_PAD_IDX)[0]

                tokens.append(list(sent_texts[args]))
                labels.append(list(sent_tags_true[args]))
                pred_labels.append(list(preds[args]))

  return tokens, labels, pred_labels

In [None]:
def trainer(model, train_dataloader, valid_dataloader, optimizer, idx2tag, epochs=5, pad_idx=0, clip=1):

    best_acc = 0
    best_loss = 1000

    for epoch_num in range(epochs):

        classes = []
        predicted_classes = []
        epoch_acc_train = 0
        epoch_loss_train = 0

        model.train()



        for batch, train_label in tqdm(train_dataloader):
            optimizer.zero_grad()
            predictions, loss = model(batch['input_ids'].to(device), batch['attention_mask'].to(device), train_label.to(device))

            train_label = train_label.view(-1).cpu()
            predictions = predictions.view(-1).cpu()
            texts = batch['input_ids'].to(device).view(-1).cpu()

            true_pred = ((train_label == predictions) & (texts != pad_idx)).sum()
            num_pred = (texts != pad_idx).sum()
            acc = true_pred / num_pred
            epoch_acc_train += acc

            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip)
            optimizer.step()
            epoch_loss_train += loss.item()

        train_accuracy = epoch_acc_train / len(train_dataloader)
        train_loss = epoch_loss_train / len(train_dataloader)

        model.eval()

        epoch_acc_valid = 0
        epoch_loss_valid = 0

        for batch, valid_label in valid_dataloader:
            optimizer.zero_grad()
            predictions, loss = model(batch['input_ids'].to(device), batch['attention_mask'].to(device), valid_label.to(device))

            epoch_loss_valid += loss.item()

            valid_label = valid_label.view(-1).cpu()
            predictions = predictions.view(-1).cpu()
            texts = batch['input_ids'].to(device).view(-1).cpu()

            true_pred = ((valid_label == predictions) & (texts != pad_idx)).sum()
            num_pred = (texts != pad_idx).sum()
            acc = true_pred / num_pred
            epoch_acc_valid += acc

            predicted_classes.extend(predictions.cpu().numpy())
            classes.extend(valid_label.cpu().numpy())

        valid_accuracy = epoch_acc_valid / len(valid_dataloader)
        valid_loss = epoch_loss_valid / len(valid_dataloader)

        print(
            f'Epochs: {epoch_num + 1} | Accuracy: {valid_accuracy: .3f}')

In [None]:
trainer(model,  train_datloader, dev_dataloader, optimizer, idx2tag, epochs=10, pad_idx=0)

 16%|█▌        | 34/218 [17:00<1:32:01, 30.01s/it]


KeyboardInterrupt: ignored

In [None]:
tokens, labels, pred_labels = label_prediction(model, test_dataloader, TAG_PAD_IDX=0, exclude_pad=True)
flat_labels = [el for subset in labels for el in subset]
flat_preds = [el for subset in pred_labels for el in subset]

In [None]:
true_labels = [idx2tag[l] for l in flat_labels]
predict_labels = [idx2tag[l]  for l in flat_preds]
labels = [i for i in idx2tag if i not in ('<PAD>', '<UNK>')]
print(classification_report(true_labels, predict_labels, labels=labels))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1570
           1       0.85      1.00      0.92     17573
           2       0.00      0.00      0.00        21
           3       0.00      0.00      0.00       362
           4       0.00      0.00      0.00       221
           5       0.00      0.00      0.00        25
           6       0.00      0.00      0.00        15
           7       0.00      0.00      0.00      1189
           8       0.00      0.00      0.00        28
           9       0.00      0.00      0.00        53
          10       0.00      0.00      0.00        94
          11       0.00      0.00      0.00       866
          12       0.00      0.00      0.00        79
          13       0.00      0.00      0.00       120

    accuracy                           0.86     22216
   macro avg       0.13      0.14      0.14     22216
weighted avg       0.74      0.86      0.80     22216



Пункт 3 (CNN + Transformer + CRF)

In [None]:
class CNN_TRANSFORMER_CRF(nn.Module):
    def __init__(self,
                word2id,
                char2id,
                idx2tag,
                num_classes,
                device,
                word_embedding_dim=300,
                char_embedding_dim=20,
                num_filters=100,
                filter_size=3,
                drop_out=0.5,
                pad_idx=0,
                pretrained_embedding=None,
                ):
        super(CNN_TRANSFORMER_CRF, self).__init__()

        self.device = device
        self.pad_idx = pad_idx
        self.idx2tag = idx2tag

        # embeddings
        self.word_embedding = nn.Embedding(len(word2id), word_embedding_dim, padding_idx=pad_idx)
        # init embedding, if pretrained provided - add weights
        if type(pretrained_embedding) == np.ndarray:
            print('add pretrained embeddings')
            self.word_embedding.weight.data.copy_(torch.from_numpy(pretrained_embedding))
            self.word_embedding.weight.requires_grad = True

        # cnn
        self.embedding_dim = char_embedding_dim
        self.char_embedding = nn.Embedding(len(char2id), char_embedding_dim)
        self.char_embedding.weight.requires_grad = True
        self.cnn = nn.Conv3d(in_channels=1, out_channels=num_filters, kernel_size=(1, filter_size, char_embedding_dim))

        # transformer
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        self.dropout = nn.Dropout(drop_out)

        # crf
        self.crf = CRF(num_classes, idx2tag)

    def forward(self, words_to_id, chars_to_id, mask, label=None):
        word_embedding = self.word_embedding(words_to_id)
        char_embedding = self.char_embedding(chars_to_id)
        char_embedding = char_embedding.unsqueeze(1)
        char_embedding = self.cnn(char_embedding).squeeze(-1)

        embedding = torch.cat([word_embedding, char_embedding], 2)

        bert_out = self.bert(embedding)[0]
        out = self.dropout(bert_out)

        # crf
        out = out.contiguous()
        if label is not None:
            loss = -self.crf(out, label, mask, reduction='mean')
        else:
            loss = None

        predicted_id = self.crf.decode(out, mask)
        seq_len = label.shape[-1]
        for i, j in enumerate(predicted_id):
            predicted_id[i] += [0]*(seq_len - mask[i, :].sum())
        predicted_id = torch.tensor(predicted_id, dtype=torch.int)
        return predicted_id, loss

    def compute_all(self, batch):
        words_to_id = batch['tokens']
        chars_to_id = batch['chars']
        label = batch['labels']
        mask = batch['mask']
        out = self.forward(words_to_id, chars_to_id)
        loss = -self.crf(out, label, mask, reduction='mean')

        seq_len = label.shape[-1]
        predicted_id = self.crf.decode(out, mask)

        for i, j in enumerate(predicted_id):
          predicted_id[i] += [0]*(seq_len - mask[i, :].sum())
        predicted_id = torch.tensor(predicted_id, dtype=torch.int).to(self.device)

        true_pred = ((label == predicted_id) & (label != self.pad_idx)).sum()
        num_pred = (label != self.pad_idx).sum()
        acc = (true_pred / num_pred).cpu()
        metrics = dict(acc=acc)

        return loss, metrics

    def compute_all(self, batch):
        words_to_id = batch['tokens']
        chars_to_id = batch['chars']
        label = batch['labels']
        mask = batch['mask']
        out = self.forward(words_to_id, chars_to_id)
        loss = -self.crf(out, label, mask, reduction='mean')

        seq_len = label.shape[-1]
        predicted_id = self.crf.decode(out, mask)

        for i, j in enumerate(predicted_id):
          predicted_id[i] += [0]*(seq_len - mask[i, :].sum())
        predicted_id = torch.tensor(predicted_id, dtype=torch.int).to(self.device)

        true_pred = ((label == predicted_id) & (label != self.pad_idx)).sum()
        num_pred = (label != self.pad_idx).sum()
        acc = (true_pred / num_pred).cpu()
        metrics = dict(acc=acc)

        return loss, metrics

    def decode(self, words_to_id, chars_to_id, mask):
        out = self.forward(words_to_id, chars_to_id)
        predicted_id = self.crf.decode(out, mask)
        return predicted_id

    def _eval(self, valid_dataloader):

        self.eval()

        true_labels = []
        predict_labels = []

        for batch in valid_dataloader:
            batch = {k: v.to(self.device) for k, v in batch.items()}
            words_to_id = batch['tokens'].to(device)
            chars_to_id = batch['chars'].to(device)
            label = batch['labels'].to(device)
            mask = batch['mask'].to(device)

            out = self.decode(words_to_id, chars_to_id, mask)

            for out_sentence, label_sentence in zip(out, label.tolist()):
                    for predict_label, true_label in zip(out_sentence,label_sentence):
                        true_labels.append(self.idx2tag[true_label])
                        predict_labels.append(self.idx2tag[predict_label])
        report = classification_report(true_labels, predict_labels, output_dict=True)
        return report


Архитектура составлена по аналогии с классической задачей, но из-за блокировок в коллабе не успели корректно обучить модель (обучение полностью аналогично СNN-BiLSTM-CRF только модель наследуется из данного класса)

## Часть 3. [2 балла] Извлечение событий

1. Используйте BiLSTM на эмбеддингах слов для извлечения триггеров событий.

2. Замените часть модели на  словах  на ELMo и/или BERT.  Должна получиться модель ELMo / BERT + BiLSTM.

[бонус] Предобучите BiLSTM как языковую модель. Дообучите ее для извлечения триггеров.

[бонус] Дообучите BERT для извлечения триггеров событий.

In [None]:
!git clone https://github.com/dbamman/litbank.git

fatal: destination path 'litbank' already exists and is not an empty directory.


In [None]:
!pip install transformers
import os
import pandas as pd
from tqdm.auto import tqdm
from collections import defaultdict, Counter
import plotly.express as px
from sklearn.manifold import TSNE
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans, AgglomerativeClustering
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.utils.tensorboard import SummaryWriter
import json
import pandas as pd
import shutil
import os
from copy import deepcopy
from tqdm.auto import tqdm
from collections import defaultdict
# from transformers import AutoTokenizer, BertModel
import matplotlib.pyplot as plt



In [None]:
root = '/content/litbank/events/tsv'
files = sorted(list(os.walk(root))[0][-1])

In [None]:
import random
def train_dev_test_split(data, train_percent = 80, dev_percent = 10, test_percent = 10):
  random.shuffle(data)
  train_size = int(len(data) * train_percent / 100)
  train_data = data[:train_size]
  dev_size = int(len(data) * dev_percent / 100)
  dev_data = data[train_size:train_size+dev_size]
  test_size = int(len(data) * test_percent / 100)
  test_data = data[train_size+dev_size:]
  return train_data, dev_data, test_data

In [None]:
train_data, dev_data, test_data = train_dev_test_split(files, 80, 10, 10)

In [None]:
from string import punctuation

def prepare_sents(root, files, verbose=True):
    sents, labels = [], []
    for f in tqdm(files, disable=not verbose):
        path = os.path.join(root, f)
        df = pd.read_csv(path, sep='\t', quoting=3, header=None)
        words = list(df[0])
        labels_list = list(df[1])
        sentences = []
        labels_sent = []
        for i in range(len(words)):
            if words[i] not in ['.', '!', '?', '...', '']:
                if words[i] in punctuation:
                    continue
                sentences.append(words[i].lower())
                labels_sent.append(labels_list[i])
            elif sentences != []:
                sents.append(sentences)
                labels.append(labels_sent)
                sentences = []
                labels_sent = []
    return sents, labels

sents_train, labels_train = prepare_sents(root, train_data)
sents_dev, labels_dev = prepare_sents(root, dev_data)
sents_test, labels_test = prepare_sents(root, test_data)

  0%|          | 0/80 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = AutoModel.from_pretrained("bert-base-uncased", output_hidden_states=True)

In [None]:
def vectorizing(sent, embed_dim=768):
    sent_raw = ' '.join(sent)
    encoded = bert_tokenizer.encode_plus(sent_raw, return_tensors="pt")
    with torch.no_grad():
        output = bert_model(**encoded)
        states = output.hidden_states
        concat = torch.stack(states[-4:]).sum(0).squeeze()
    embeds = np.zeros((len(sent), embed_dim))
    for idx in range(len(sent)):
        token_ids_word = np.where(np.array(encoded.word_ids()) == idx)
        embed = concat[token_ids_word].mean(dim=0)
        embeds[idx] = embed
    return embeds

In [None]:
embeds_train = []
for sent in tqdm(sents_train):
    embed = vectorizing(sent)
    embeds_train.append(embed)

  0%|          | 0/7086 [00:00<?, ?it/s]

In [None]:
embeds_dev = []
for sent in tqdm(sents_dev):
    embed = vectorizing(sent)
    embeds_dev.append(embed)

  0%|          | 0/902 [00:00<?, ?it/s]

In [None]:
class CustomDataset(Dataset):
    def __init__(self, sents, labels):
        self.sents = sents
        self.labels = labels

    def __len__(self):
        return len(self.sents)

    def __getitem__(self, idx):
        sent = self.sents[idx]
        label = self.labels[idx]
        sent_len = len(sent)

        label = (np.array(label) == 'EVENT').astype(np.int32)

        return {
            'sample': sent,
            'label': label,
            'length': sent_len
        }

In [None]:
train_ds = CustomDataset(embeds_train, labels_train)
val_ds = CustomDataset(embeds_dev, labels_dev)

In [None]:
from sklearn.metrics import f1_score

class BiLSTM(nn.Module):
    def __init__(self, embedding_dim=768, hidden_dim=256, num_layers=1, lr_scheduler=None, lr_scheduler_type=None):
        super(BiLSTM, self).__init__()
        self.rnn = nn.LSTM(embedding_dim, hidden_dim // 2, num_layers=num_layers, bidirectional=True, dropout=0.3)
        self.lr_scheduler = lr_scheduler
        self.lr_scheduler_type = lr_scheduler_type
        self.criterion = nn.CrossEntropyLoss(weight=torch.tensor([1., 24.]))
        self.embedder = nn.Linear(embedding_dim, embedding_dim // 2)
        self.hidden2tag = nn.Linear(hidden_dim, 2)
        self.hidden_dim = hidden_dim

    def forward(self, x, length):
        packed = nn.utils.rnn.pack_padded_sequence(x, length, batch_first=True, enforce_sorted=False)
        output, (h_n, c_n) = self.rnn(packed)
        return self.hidden2tag(output.data)

    def compute_all(self, batch):
        sample = batch['sample'].float()
        length = batch['length'].cpu()
        labels = batch['label']

        logits = self.forward(sample, length)

        labels = nn.utils.rnn.pack_padded_sequence(labels.float(), length, batch_first=True, enforce_sorted=False).data

        loss = self.criterion(logits, labels.long())
        acc = f1_score(labels.long().cpu().numpy(), torch.argmax(logits, axis=1).detach().cpu().numpy())

        metrics = dict(acc=acc, loss=loss.item())
        return loss, metrics

    def post_train_batch(self):
        if self.lr_scheduler is not None and self.lr_scheduler_type == 'per_batch':
            self.lr_scheduler.step()

    def post_val_stage(self, val_loss):
        if self.lr_scheduler is not None and self.lr_scheduler_type == 'per_epoch':
            self.lr_scheduler.step(val_loss)

In [None]:
def collate_fn(batch):
    return {'sample': torch.nn.utils.rnn.pad_sequence([torch.tensor(d['sample']) for d in batch], batch_first=True),
            'label': torch.nn.utils.rnn.pad_sequence([torch.tensor(d['label']) for d in batch], batch_first=True),
            'length': torch.tensor([d['length'] for d in batch])}

In [None]:
class model_training:
    def __init__(self, model, optimizer, train_loader, val_loader, output_folder: str = '/content/drive/MyDrive/rbg_unet/', batch_size: int = 4):
        self.output_folder = output_folder
        self.tboard_log_dir = './tboard_logs/'
        self.model = model
        self.optimizer = optimizer
        self.train_dataset = train_loader
        self.val_dataset = val_loader
        self.batch_size = batch_size

        shutil.rmtree(self.tboard_log_dir, ignore_errors=True)
        if os.path.exists(self.output_folder):
            self.load_checkpoint()
        else:
            os.makedirs(self.output_folder)
            self.global_step = 0
            self.prev_epoch = 0
            self.best_loss = float('inf')
            self.global_step = 0
        self.train_writer = SummaryWriter(log_dir=self.tboard_log_dir + "train/")
        self.val_writer = SummaryWriter(log_dir=self.tboard_log_dir + "val/")
        self.cache = self.cache_states()

    def load_checkpoint(self):
        checkpoint = torch.load(os.path.join(self.output_folder, 'last_checkpoint.pth'))
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        self.prev_epoch = checkpoint['epoch_num']
        self.best_loss = checkpoint['loss']
        self.global_step = checkpoint['global_step']
        shutil.copytree(os.path.join(self.output_folder, 'tboard_logs'), self.tboard_log_dir)
        return

    def save_checkpoint(self, path, model_state_dict, optimizer_state_dict, loss, epoch_num):
        if os.path.exists(os.path.join(self.output_folder, 'tboard_logs')):
            shutil.rmtree(os.path.join(self.output_folder, 'tboard_logs'))
        shutil.copytree(self.tboard_log_dir, os.path.join(self.output_folder, 'tboard_logs'))
        torch.save({
            'model_state_dict': model_state_dict,
            'optimizer_state_dict': optimizer_state_dict,
            'loss': loss,
            'epoch_num': epoch_num,
            'global_step': self.global_step,
        }, path)

    def train(self, num_epochs: int):
        model = self.model
        optimizer = self.optimizer

        for epoch in range(self.prev_epoch + 1, self.prev_epoch + num_epochs + 1):
            model.train()
            for batch in tqdm(train_loader, desc='Epoch {}'.format(epoch)):
                batch = {k: v.float() for k, v in batch.items()}
                loss, details = model.compute_all(batch)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                model.post_train_batch()
                for k, v in details.items():
                    self.train_writer.add_scalar(k, v, global_step=self.global_step)
                self.train_writer.flush()
                self.global_step += 1

            model.eval()
            val_accs = []
            val_losses = []
            val_logs = defaultdict(list)
            for batch in tqdm(val_loader):
                batch = {k: v.float() for k, v in batch.items()}
                loss, details = model.compute_all(batch)
                val_losses.append(loss.item())
                val_accs.append(details['acc'])
                for k, v in details.items():
                    val_logs[k].append(v)
            val_logs = {k: np.mean(v) for k, v in val_logs.items()}

            for k, v in val_logs.items():
                self.val_writer.add_scalar(k, v, global_step=self.global_step)
            self.val_writer.flush()

            val_acc = np.mean(val_accs)
            val_loss = np.mean(val_losses)
            print('Epoch #{} Loss: {} Accuracy: {}'.format(epoch, val_loss, val_acc))

            model.post_val_stage(val_loss)
            self.save_checkpoint(os.path.join(self.output_folder, 'last_checkpoint.pth'), model.state_dict(), optimizer.state_dict(), val_loss, epoch)
            if val_loss < self.best_loss:
                self.save_checkpoint(os.path.join(self.output_folder, 'best_checkpoint.pth'), model.state_dict(), optimizer.state_dict(), val_loss, epoch)
                self.best_loss = val_loss

    def cache_states(self):
        return {'model_state': deepcopy(self.model.state_dict()), 'optimizer_state': deepcopy(self.optimizer.state_dict())}

In [None]:
batch_size=16

model = BiLSTM(hidden_dim=64, num_layers=4)
opt = torch.optim.Adam(model.parameters(), lr=1e-2)

train_loader = DataLoader(train_ds, shuffle=True, pin_memory=True, batch_size=batch_size, collate_fn=collate_fn)
val_loader = DataLoader(val_ds, shuffle=False, pin_memory=True, batch_size=batch_size, collate_fn=collate_fn)

training = model_training(model, opt, train_loader, val_loader, batch_size=16, output_folder = '/content/drive/MyDrive/ner-2/testing33')

In [None]:
training.train(100)

Epoch 15:   0%|          | 0/443 [00:00<?, ?it/s]

  0%|          | 0/57 [00:00<?, ?it/s]

Epoch #15 Loss: 0.1814485943892546 Accuracy: 0.41423061310291637


Epoch 16:   0%|          | 0/443 [00:00<?, ?it/s]

  0%|          | 0/57 [00:00<?, ?it/s]

Epoch #16 Loss: 0.1905989033872621 Accuracy: 0.44180158822343085


Epoch 17:   0%|          | 0/443 [00:00<?, ?it/s]

  0%|          | 0/57 [00:00<?, ?it/s]

Epoch #17 Loss: 0.18950096609299644 Accuracy: 0.4185751916106374


Epoch 18:   0%|          | 0/443 [00:00<?, ?it/s]

## Часть 4. [2 балла] Одновременное извлечение именованных сущностей и событий
1. Обучите модель для совместного извлечения именованных сущностей и триггеров событий. У модели должен быть общий энкодер (например, CNN + BiLSMT, ELMo + BiLSTM, BERT + BiLSTM) и два декодера: один отвечает за извлечение именнованных сущностей, другой отвечает за извлечение триггеров событий.

[бонус] Добавьте в модель механизм внимания, так, как это покажется вам разумным.

[бонус] Визуализируйте карты механизма внимания.

Структура корпуса устроена так.
Первый уровень:
* entities -- разметка по сущностям
* events -- разметка по сущностям

В корпусе используются 6 типов именованных сущностей: PER, LOC, ORG, FAC, GPE, VEH (имена, локации, организации, помещения, топонимы, средства перемещния), допускаются вложенные сущности.

События выражается одним словом - *триггером*, которое может быть глагом, прилагательным и существительным. В корпусе описаны события, которые действительно происходят и не имеют гипотетического характера.
Пример: she *walked* rapidly and resolutely, здесь *walked* -- триггер события. Типы событий не заданы.

Второй уровень:
* brat -- рабочие файлы инструмента разметки brat, ann-файлы содержат разметку, txt-файлы – сырые тексты
* tsv -- tsv-файлы содержат разметку в IOB формате

Обучите модель для совместного извлечения именованных сущностей и триггеров событий. У модели должен быть общий энкодер BERT + BiLSMT и два декодера: один отвечает за извлечение именнованных сущностей, другой отвечает за извлечение триггеров событий.
Также:
Добавьте в модель механизм внимания, так, как это покажется вам разумным.
Визуализируйте карты механизма внимания.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
!git clone https://github.com/dbamman/litbank.git

fatal: destination path 'litbank' already exists and is not an empty directory.


In [None]:
import os

root = 'litbank/entities/tsv'
files = sorted([
    f"{root}/{file}"
    for file in list(os.walk(root))[0][-1]])

assert len(files) == 100

In [None]:
from string import punctuation
from tqdm import tqdm
import pandas as pd

def get_sents(path):
    df_entities = pd.read_csv(path, sep='\t', quoting=3, header=None)
    df_events = pd.read_csv(path.replace('/entities/', '/events/'), sep='\t', quoting=3, header=None)
    df = pd.concat([
            df_entities[[0, 1]],
            df_events,
        ], ignore_index=True, axis=1).dropna(subset=[0])

    if len(df[df[0] != df[2]]):
        raise ValueError(f"Something go wrong, file = {path}")

    words = list(df[0])
    entities_labels = list(df[1])
    events_labels = list(df[3])

    assert len(words) == len(entities_labels) and len(words) == len(events_labels), 'Что-то не так'
    sentences = [[]]
    entities_labels_sent = [[]]
    events_labels_sent = [[]]
    for i in range(len(words)):
        if words[i] not in ['.', '!', '?', '...', '']:
            if words[i] in punctuation:
                continue
            sentences[-1].append(words[i].lower())
            entities_labels_sent[-1].append(entities_labels[i])
            events_labels_sent[-1].append(events_labels[i])
        elif sentences[-1] != []:
            sentences.append([])
            entities_labels_sent.append([])
            events_labels_sent.append([])
    return sentences, entities_labels_sent, events_labels_sent

def prepare_sents(files, verbose=True):
    sents, ent_labels, event_labels = [], [], []
    for f in tqdm(files, disable=not verbose):
        sent, ent, event = get_sents(f)
        if sent[-1] == []:
            sent = sent[:-1]
            ent = ent[:-1]
            event = event[:-1]
        sents += sent
        ent_labels += ent
        event_labels += event
    return sents, ent_labels, event_labels

In [None]:
sents, ent_labels, event_labels = prepare_sents(files)

100%|██████████| 100/100 [00:00<00:00, 112.91it/s]


In [None]:
from sklearn.model_selection import train_test_split

def split_dataset(sents, ent_labels, event_labels, train_size=0.8, random_state=42):
    """
    Split dataset into train, dev, and test sets.

    Parameters:
    sents (list): List of sentences.
    ent_labels (list): List of entity labels.
    event_labels (list): List of event labels.
    train_size (float): The proportion of the dataset to include in the train split.
    random_state (int): Controls the shuffling applied to the data before applying the split.

    Returns:
    train_data, dev_data, test_data (tuple): Train, dev, and test splits.
    """

    # Create data indices for train, dev, and test splits
    data_size = len(sents)
    train_indices, remaining_indices = train_test_split(
        range(data_size), train_size=train_size, random_state=random_state
    )
    dev_indices, test_indices = train_test_split(
        remaining_indices, train_size=0.5, random_state=random_state
    )

    # Split data into train, dev, and test splits
    train_data = [sents[i] for i in train_indices], [ent_labels[i] for i in train_indices], [event_labels[i] for i in train_indices]
    dev_data = [sents[i] for i in dev_indices], [ent_labels[i] for i in dev_indices], [event_labels[i] for i in dev_indices]
    test_data = [sents[i] for i in test_indices], [ent_labels[i] for i in test_indices], [event_labels[i] for i in test_indices]

    return train_data, dev_data, test_data

In [None]:
train_data, dev_data, test_data = split_dataset(sents, ent_labels, event_labels, 0.8, random_state=42)

train_data[0][0], train_data[1][0], train_data[2][0]

(['lead', 'him', 'not', 'into', 'temptation'],
 ['O', 'O', 'O', 'O', 'O'],
 ['O', 'O', 'O', 'O', 'O'])

In [None]:
import numpy as np

In [None]:
def flatten(lst):
    result = []
    for i in lst:
        if isinstance(i, list):
            result.extend(flatten(i))
        else:
            result.append(i)
    return result

In [None]:
sorted(set(flatten(ent_labels))), sorted(set(flatten(event_labels)))

(['B-FAC',
  'B-GPE',
  'B-LOC',
  'B-ORG',
  'B-PER',
  'B-VEH',
  'I-FAC',
  'I-GPE',
  'I-LOC',
  'I-ORG',
  'I-PER',
  'I-VEH',
  'O'],
 ['EVENT', 'O'])

In [None]:
entities_dict = {
    'B-FAC': 12,
    'B-GPE': 11,
    'B-LOC': 10,
    'B-ORG': 9,
    'B-PER': 8,
    'B-VEH': 7,
    'I-FAC': 6,
    'I-GPE': 5,
    'I-LOC': 4,
    'I-ORG': 3,
    'I-PER': 2,
    'I-VEH': 1,
    'O': 0,
}
triggers_dict = {
    'EVENT': 1,
    'O': 0,
}

In [None]:
from torch.utils.data import Dataset
from transformers import DistilBertTokenizer
import torch

class EntityEventDataset(Dataset):
    def __init__(self, sentences, entities, triggers):
        self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
        self.sentences = sentences
        self.entities = entities
        self.triggers = triggers

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        sentence = self.sentences[idx]
        entities = self.entities[idx]
        triggers = self.triggers[idx]
        encoding = self.tokenizer.encode_plus(sentence, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()
        entity_ids = torch.tensor([entities_dict[key] for key in entities])
        trigger_ids = torch.tensor([triggers_dict[key] for key in triggers])
        return {'input_ids': input_ids,
                'attention_mask': attention_mask,
                'entity_ids': entity_ids,
                'trigger_ids': trigger_ids}

In [None]:
train_dataset = EntityEventDataset(*train_data)
dev_dataset = EntityEventDataset(*dev_data)
test_dataset = EntityEventDataset(*test_data)

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print(f"Using device: {device}")

Using device: cuda


In [None]:
from torch.nn import LSTM, Linear
from transformers import DistilBertModel
from tqdm import tqdm, trange
from IPython.display import clear_output
import gc
gc.enable()

class EntityEventModel(torch.nn.Module):
    def __init__(self, hidden_dim, dropout_prob, device):
        super(EntityEventModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.dropout_prob = dropout_prob
        self.device = device
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        self.lstm = LSTM(768, hidden_dim, num_layers=1, batch_first=True, dropout=dropout_prob, bidirectional=True)
        self.entity_classifier = Linear(hidden_dim * 2, 13)
        self.trigger_classifier = Linear(hidden_dim * 2, 2)
        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.entity_classifier.bias.data.zero_()
        self.entity_classifier.weight.data.uniform_(-initrange, initrange)
        self.trigger_classifier.bias.data.zero_()
        self.trigger_classifier.weight.data.uniform_(-initrange, initrange)

    def forward(self, input_ids, attention_mask, entity_ids, trigger_ids):
        bert_outputs = self.bert(input_ids, attention_mask=attention_mask)
        outputs, _ = self.lstm(bert_outputs.last_hidden_state)
        entity_outputs = torch.sigmoid(self.entity_classifier(outputs)).squeeze()
        trigger_outputs = torch.sigmoid(self.trigger_classifier(outputs)).squeeze()

        entity_outputs = entity_outputs[attention_mask == 1][1:-1]
        trigger_outputs = trigger_outputs[attention_mask == 1][1:-1]
        return entity_outputs, trigger_outputs

    def train_model(self, train_loader, valid_loader, num_epochs, lr):
        self.to(self.device)
        optimizer = torch.optim.Adam(self.parameters(), lr=lr)
        loss_function = torch.nn.BCEWithLogitsLoss()
        for epoch in range(num_epochs):
            # clear_output()
            print(f"Epoch {epoch + 1}")
            self.train()
            total_loss = 0
            for data in tqdm(train_loader):
                input_ids = data['input_ids'].to(self.device)
                attention_mask = data['attention_mask'].to(self.device)
                entity_ids = data['entity_ids'].to(self.device)
                trigger_ids = data['trigger_ids'].to(self.device)
                outputs = self(input_ids, attention_mask, entity_ids, trigger_ids)

                entity_loss = loss_function(outputs[0], torch.nn.functional.one_hot(entity_ids, 13).float())
                trigger_loss = loss_function(outputs[1], torch.nn.functional.one_hot(trigger_ids, 2).float())
                loss = entity_loss + trigger_loss
                total_loss += loss.item()

                optimizer.step()
                optimizer.zero_grad()
            avg_train_loss = total_loss / len(train_loader)
            valid_loss = self.evaluate_model(valid_loader)
            print(f"Epoch {epoch+1}/{num_epochs} | Train Loss: {avg_train_loss:.6f} | Valid Loss: {valid_loss:.6f}")
            print()


    def evaluate_model(self, valid_loader):
      self.eval()
      total_loss = 0
      loss_function = torch.nn.BCEWithLogitsLoss()

      with torch.no_grad():
          for data in valid_loader:
              input_ids = data['input_ids'].to(self.device)
              attention_mask = data['attention_mask'].to(self.device)
              entity_ids = data['entity_ids'].to(self.device)
              trigger_ids = data['trigger_ids'].to(self.device)
              outputs = self(input_ids, attention_mask, entity_ids, trigger_ids)

              entity_loss = loss_function(outputs[0], torch.nn.functional.one_hot(entity_ids, 13).float())
              trigger_loss = loss_function(outputs[1], torch.nn.functional.one_hot(trigger_ids, 2).float())
              loss = entity_loss + trigger_loss
              total_loss += loss.item()

      avg_valid_loss = total_loss / len(valid_loader)
      return avg_valid_loss


In [None]:
model = EntityEventModel(256, 0.2, device)
model.train_model(train_dataset, dev_dataset, 5, 0.01)

Epoch 1


100%|██████████| 6939/6939 [02:52<00:00, 40.32it/s]


Epoch 1/5 | Train Loss: 1.656332 | Valid Loss: 1.656118

Epoch 2


100%|██████████| 6939/6939 [02:52<00:00, 40.19it/s]


Epoch 2/5 | Train Loss: 1.656320 | Valid Loss: 1.656118

Epoch 3


100%|██████████| 6939/6939 [02:52<00:00, 40.23it/s]


Epoch 3/5 | Train Loss: 1.656289 | Valid Loss: 1.656118

Epoch 4


100%|██████████| 6939/6939 [02:54<00:00, 39.78it/s]


Epoch 4/5 | Train Loss: 1.656287 | Valid Loss: 1.656118

Epoch 5


100%|██████████| 6939/6939 [02:58<00:00, 38.97it/s]


Epoch 5/5 | Train Loss: 1.656327 | Valid Loss: 1.656118



## Часть 5. [1 балл] Итоги
Напишите краткое резюме проделанной работы. Сравните результаты всех разработанных моделей. Что помогло вам в выполнении работы, чего не хватало?

**Часть 1.**
Был выполнен частотный анализ сущностей и событий. Также события были разделены по кластерам.

**Часть 2.**
В выполнении работы помогли ссылки на корпус и на статью по извлечению именованных сущностей.

В части 2 выполнено извлечение сущностей.
Все модели достаточно долго работают, но дают неплохие результаты.

Не хватило времени для более глубокой работы с моделями, так как обучение занимает большое время. Также много времени заняла отладка конвертируемости.
Еще хотелось бы, чтобы в семинарах было что-то посложнее, чем то, что в них есть сейчас. Для выполнения задания они совсем не помогли. Либо нужно начинать задание с более базовых моделей, чтобы быстрее понять принципы и перейти к усложению.

Преимущества:
1. За счет использования CNNBiLSTMCRF получилось хорошее качество, так как мы использовали сочетание семантического и синтаксического представлений.
2. При добавлении BERT качество осталось хорошим, так как BERT - одна из самых сильных моделей для текстовой классификации.

Недостатки:
1. Память + время

**Часть 3.**
В части 3 выполнено извлечение триггеров событий.
В выполнении работы помогли ссылки на корпус.

Были ощутимые проблемы с конвертируемостью данных во время препроцессинга.

Преимущества:
1. Хорошее качество модели при использовани BERT, так как BERT - одна из самых сильных моделей для текстовой классификации.

Недостатки:
1. Память + время

**Часть 4.**
В данной работе была предпринята попытка создания модели для обнаружения сущностей и триггеров с использованием BERT и BiLSTM.

Преимущества:

Использование билинейного слоя позволяет раздельно обучать классификаторы сущностей и триггеров, что улучшает общую производительность модели.

Недостатки:

Размер модели, включая предварительно обученную модель BERT, может быть большим для работы на устройствах с ограниченными ресурсами. (у нас так и случилось)

Для улучшения модели было бы полезно использовать более сложные архитектуры, такие как трансформеры с вложенными масками, которые в состоянии лучше захватывать зависимости между токенами в предложении.