Использовал туториал:
https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f

В этом задании Вам предлагается решить проблему классификации текстов разными методами.

Среди таких методов мы можем предложить Вам:

1) Простой Байесовский классификатор на основе мультиномиальной модели или модели Бернулли

>Достоинства: идейная простота и простота реализации, неплохая интерпретируемость

>Недостатки: относительно слабая предсказательная способность

> Frameworks: `numpy`

2) Логистическая регрессия на основе векторов TF-IDF

>Достоинства: достаточно высокая скорость обучения, простой метод составления эмбеддингов

>Недостатки: также довольно слабая предсказательная способность, слишком высокая размерность задачи

> Frameworks: `sklearn`, `numpy`

3) Логистическая регрессия или нейронная сеть + word2vec embeddings

> Достоинства: оптимальная размерность эмбеддингов, довольно простые модели, сравнительно неплохое качество

> Недостатки: устаревший метод построения эмбеддингов. Эмбеддинги не контекстуальные

> Frameworks: `gensim`, `pytorch`, `sklearn`

4) Рекуррентная нейронная сеть + word2vec:

> Достоинства: Более современная нейронная сеть

> Недостатки: недоступно распараллеливание

> Frameworks: `pytorch`, `gensim`

5) ELMO + любая нейронная сеть

> Достоинства: отличный контекстуальный метод векторизации текстов, мощная модель

> Недостатки: сложность моделей

> Frameworks: `elmo`, `pytorch`

6) Bert + любая нейронная сеть

> Достоинства: отличный контекстуальный метод векторизации текстов, мощная модель

> Недостатки: сложность моделей

> Frameworks: `transformers`, `pytorch`

Вы также можете исследовать любые комбинации методов векторизации и моделей ML, которые сочтете нужными.

Ваша задача: провести сравнительный анализ не менее 3 алгоритмов классификации текстов. Сравнение стоит проводить по следующим параметрам:

- Качество классификации (актуальную метрику выберите самостоятельно)
- Время обучения модели
- Характерное время инференса модели

Данные можно загрузить по ссылке: https://drive.google.com/drive/folders/14hR7Pm2sH28rQttkD906PTLvtwHFLBRm?usp=sharing

Для упрощения Вашей работы предлагаем ряд функций для предобработки текстов.

In [None]:
import re, string
regex = re.compile('[%s]' % re.escape(string.punctuation))
def clear(text: str) -> str:
    text = regex.sub('', text.lower())
    text = re.sub(r'[«»\n]', ' ', text)
    text = text.replace('ё', 'е')
    return text.strip()

In [None]:
import nltk #natural language toolkit
from nltk.stem import WordNetLemmatizer

nltk.download('omw-1.4')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
eng_stopwords = stopwords.words("english")

remove_stopwords = lambda tokenized_text, stopwords: [w for w in tokenized_text if not w in stopwords]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import numpy

In [None]:

# /content/drive/MyDrive/dev.labels
# /content/drive/MyDrive/dev.texts
# /content/drive/MyDrive/train.labels
# /content/drive/MyDrive/train.texts


In [None]:
def read_file(path, if_text=False):
  data = []
  with open(path) as f:
    if if_text:
      for line in f:
        data.append(' '.join(remove_stopwords(clear(line.strip('\n')).split(' '), eng_stopwords)))
    else:
      for line in f:
        data.append(line.strip())
  return data

test_text = read_file('/content/drive/MyDrive/dev.texts', True)
test_labels = read_file('/content/drive/MyDrive/dev.labels')
train_text = read_file('/content/drive/MyDrive/train.texts', True)
train_labels = read_file('/content/drive/MyDrive/train.labels')

In [None]:
import pandas as pd

df = pd.DataFrame({'category' : train_labels[:1000],
                  'text' : train_text[:1000]})


df_test = pd.DataFrame({'category' : test_labels[:1000],
                  'text' : test_text[:1000]})

df.head()

Unnamed: 0,category,text
0,neg,myth regarding broken mirrors would accurate e...
1,pos,gave movie 10 needed rewarded scary elements a...
2,neg,watching first 20mn blanchesorry couldnt take ...
3,neg,weak plot unlikely car malfunction helpless fu...
4,pos,sidewalk ends 1950br br one ends another begin...


In [None]:
!pip install bert-embedding

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy>=1.17
  Using cached numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.14.6
    Uninstalling numpy-1.14.6:
      Successfully uninstalled numpy-1.14.6
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mxnet 1.4.0 requires numpy<1.15.0,>=1.8.2, but you have numpy 1.21.6 which is incompatible.
bert-embedding 1.0.1 requires numpy==1.14.6, but you have numpy 1.21.6 which is incompatible.[0m
Successfully installed numpy-1.21.6


In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

example_text = 'I will watch Memento tonight'
bert_input = tokenizer(example_text, padding='max_length', max_length = 10,
                       truncation=True, return_tensors="pt")

In [None]:
example_text = tokenizer.decode(bert_input.input_ids[0])

print(example_text)

[CLS] I will watch Memento tonight [SEP] [PAD] [PAD]


In [None]:
import torch
import numpy as np
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
labels = {'neg':0,
          'pos':1}

class Dataset(torch.utils.data.Dataset):

    def __init__(self, df):

        self.labels = [labels[label] for label in df['category']]
        self.texts = [tokenizer(text,
                               padding='max_length', max_length = 512, truncation=True,
                                return_tensors="pt") for text in df['text']]

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        # Fetch a batch of inputs
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

In [None]:

from torch import nn
from transformers import BertModel

class BertClassifier(nn.Module):

    def __init__(self, dropout=0.3):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-cased')
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 2)
        self.relu = nn.Sigmoid()

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)

        return final_layer

In [None]:
from torch.optim import Adam
from tqdm import tqdm

def train(model, train_data, val_data, learning_rate, epochs):

    train, val = Dataset(train_data), Dataset(val_data)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr= learning_rate)

    if use_cuda:

            model = model.cuda()
            criterion = criterion.cuda()

    for epoch_num in range(epochs):

            total_acc_train = 0
            total_loss_train = 0

            for train_input, train_label in tqdm(train_dataloader):

                train_label = train_label.to(device)

                mask = train_input['attention_mask'].to(device)
                input_id = train_input['input_ids'].squeeze(1).to(device)

                output = model(input_id, mask)

                batch_loss = criterion(output, train_label.long())
                total_loss_train += batch_loss.item()

                acc = (output.argmax(dim=1) == train_label).sum().item()
                total_acc_train += acc

                model.zero_grad()
                batch_loss.backward()
                optimizer.step()

            total_acc_val = 0
            total_loss_val = 0

            with torch.no_grad():

                for val_input, val_label in val_dataloader:

                    val_label = val_label.to(device)
                    mask = val_input['attention_mask'].to(device)
                    input_id = val_input['input_ids'].squeeze(1).to(device)

                    output = model(input_id, mask)

                    batch_loss = criterion(output, val_label.long())
                    total_loss_val += batch_loss.item()

                    acc = (output.argmax(dim=1) == val_label).sum().item()
                    total_acc_val += acc

            print(
                f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} \
                | Train Accuracy: {total_acc_train / len(train_data): .3f} \
                | Val Loss: {total_loss_val / len(val_data): .3f} \
                | Val Accuracy: {total_acc_val / len(val_data): .3f}')

EPOCHS = 10
model = BertClassifier()
LR = 2e-3


np.random.seed(112)
df_train, df_val, _ = np.split(df.sample(frac=1, random_state=42),
                                     [int(.8*len(df)), int(.9*len(df))])

_, _, df_test = np.split(df_test.sample(frac=1, random_state=42),
                                     [int(len(df_test)), int(0*len(df_test))])

train(model, df_train, df_val, LR, EPOCHS)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 400/400 [01:22<00:00,  4.86it/s]


Epochs: 1 | Train Loss:  0.348                 | Train Accuracy:  0.510                 | Val Loss:  0.347                 | Val Accuracy:  0.470


100%|██████████| 400/400 [01:24<00:00,  4.74it/s]


Epochs: 2 | Train Loss:  0.347                 | Train Accuracy:  0.517                 | Val Loss:  0.347                 | Val Accuracy:  0.470


100%|██████████| 400/400 [01:25<00:00,  4.70it/s]


Epochs: 3 | Train Loss:  0.347                 | Train Accuracy:  0.519                 | Val Loss:  0.347                 | Val Accuracy:  0.470


100%|██████████| 400/400 [01:25<00:00,  4.70it/s]


Epochs: 4 | Train Loss:  0.347                 | Train Accuracy:  0.504                 | Val Loss:  0.347                 | Val Accuracy:  0.460


100%|██████████| 400/400 [01:25<00:00,  4.70it/s]


Epochs: 5 | Train Loss:  0.347                 | Train Accuracy:  0.521                 | Val Loss:  0.347                 | Val Accuracy:  0.470


100%|██████████| 400/400 [01:24<00:00,  4.71it/s]


Epochs: 6 | Train Loss:  0.347                 | Train Accuracy:  0.509                 | Val Loss:  0.347                 | Val Accuracy:  0.470


100%|██████████| 400/400 [01:25<00:00,  4.70it/s]


Epochs: 7 | Train Loss:  0.347                 | Train Accuracy:  0.515                 | Val Loss:  0.347                 | Val Accuracy:  0.470


100%|██████████| 400/400 [01:25<00:00,  4.70it/s]


Epochs: 8 | Train Loss:  0.347                 | Train Accuracy:  0.511                 | Val Loss:  0.347                 | Val Accuracy:  0.470


100%|██████████| 400/400 [01:24<00:00,  4.71it/s]


Epochs: 9 | Train Loss:  0.347                 | Train Accuracy:  0.516                 | Val Loss:  0.347                 | Val Accuracy:  0.530


100%|██████████| 400/400 [01:25<00:00,  4.70it/s]


Epochs: 10 | Train Loss:  0.347                 | Train Accuracy:  0.484                 | Val Loss:  0.347                 | Val Accuracy:  0.530


In [None]:
def evaluate(model, test_data):

    test = Dataset(test_data)

    test_dataloader = torch.utils.data.DataLoader(test, batch_size=2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

        model = model.cuda()

    total_acc_test = 0
    with torch.no_grad():

        for test_input, test_label in test_dataloader:

              test_label = test_label.to(device)
              mask = test_input['attention_mask'].to(device)
              input_id = test_input['input_ids'].squeeze(1).to(device)

              output = model(input_id, mask)

              acc = (output.argmax(dim=1) == test_label).sum().item()
              total_acc_test += acc

    print(f'Test Accuracy: {total_acc_test / len(test_data): .3f}')

evaluate(model, df_test)

Test Accuracy:  0.518
