**Цель работы:** построить и оценить классификатор запросов.

**Данные и их разметка** будут взяты из репозитория, аффилированного со статьёй "Slot-Gated Modeling for Joint Slot Filling and Intent Prediction" (https://aclanthology.org/N18-2118/)

Для работы с нейросетями будем испоьзовать фреймворк pytorch и предыдущую версию torchtext

In [1]:
!pip3 install -U torch==1.8
!pip3 install -U torchtext==0.9

Collecting torch==1.8
  Downloading torch-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (735.5 MB)
[K     |████████████████████████████████| 735.5 MB 12 kB/s 
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.11.0+cu113
    Uninstalling torch-1.11.0+cu113:
      Successfully uninstalled torch-1.11.0+cu113
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.12.0+cu113 requires torch==1.11.0, but you have torch 1.8.0 which is incompatible.
torchtext 0.12.0 requires torch==1.11.0, but you have torch 1.8.0 which is incompatible.
torchaudio 0.11.0+cu113 requires torch==1.11.0, but you have torch 1.8.0 which is incompatible.[0m
Successfully installed torch-1.8.0
Collecting torchtext==0.9
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████|

In [2]:
import os
import torch
import torch.nn as nn
import torch.optim as optim
import math
from tqdm import tqdm
from torchtext.legacy.data import Field, LabelField, Example, Dataset, BucketIterator

device = torch.device('cpu')

In [3]:
# Загрузим данные из репозитория статьи утилитой wget.
# Будем ориентироваться на набор данных SNIPS (https://paperswithcode.com/dataset/snips)
!git clone https://github.com/MiuLab/SlotGated-SLU.git

Cloning into 'SlotGated-SLU'...
remote: Enumerating objects: 51, done.[K
remote: Total 51 (delta 0), reused 0 (delta 0), pack-reused 51[K
Unpacking objects: 100% (51/51), done.


Загрузим данные в память

In [4]:
# Функция для загрузки части датасета,
# Нужна, так как данные уже разбиты на части train / valid / test.
# Каждую из них загрузим вызовом этой функции
def load_data_from_disc(base_path):
    input_data_path = os.path.join(base_path, 'seq.in')
    output_data_path = os.path.join(base_path, 'label')
    with open(input_data_path, "r") as inputs, open(output_data_path, "r") as intents:
        return [(words.strip().split(), intent.strip()) for words, intent in zip(inputs, intents)]

In [5]:
# Загружаем данные
train_data = load_data_from_disc('SlotGated-SLU/data/snips/train/')
val_data = load_data_from_disc('SlotGated-SLU/data/snips/valid/')
test_data = load_data_from_disc('SlotGated-SLU/data/snips/test/')

In [6]:
# Оценим размеры имеющихся у нас данных
len(train_data), len(test_data)

(13084, 700)

In [7]:
# У нас есть разбитые на слова запросы и их метка
train_data[:16]

[(['listen', 'to', 'westbam', 'alumb', 'allergic', 'on', 'google', 'music'],
  'PlayMusic'),
 (['add', 'step', 'to', 'me', 'to', 'the', '50', 'clásicos', 'playlist'],
  'AddToPlaylist'),
 (['i',
   'give',
   'this',
   'current',
   'textbook',
   'a',
   'rating',
   'value',
   'of',
   '1',
   'and',
   'a',
   'best',
   'rating',
   'of',
   '6'],
  'RateBook'),
 (['play', 'the', 'song', 'little', 'robin', 'redbreast'], 'PlayMusic'),
 (['please',
   'add',
   'iris',
   'dement',
   'to',
   'my',
   'playlist',
   'this',
   'is',
   'selena'],
  'AddToPlaylist'),
 (['add',
   'slimm',
   'cutta',
   'calhoun',
   'to',
   'my',
   'this',
   'is',
   'prince',
   'playlist'],
  'AddToPlaylist'),
 (['i', 'want', 'to', 'listen', 'to', 'seventies', 'music'], 'PlayMusic'),
 (['play', 'a', 'popular', 'chant', 'by', 'brian', 'epstein'], 'PlayMusic'),
 (['find', 'fish', 'story'], 'SearchScreeningEvent'),
 (['book', 'a', 'spot', 'for', '3', 'in', 'mt'], 'BookRestaurant'),
 (['i',
   'n

Для работы с текстами будем использовать модуль torchtext. Туториал по этому модулю можно найти здесь: https://gzwq.github.io/2019/07/16/NLP-TorchText/ или тут: https://colab.research.google.com/github/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb

In [9]:
# Строим датасеты из torchtext и соответствующие им итераторы на основе загруженных данных
tokens_field = Field()
intent_field = LabelField()

fields = [('tokens', tokens_field), ('intent', intent_field)]

train_dataset = Dataset([Example.fromlist(example, fields) for example in train_data], fields)
val_dataset = Dataset([Example.fromlist(example, fields) for example in val_data], fields)
test_dataset = Dataset([Example.fromlist(example, fields) for example in test_data], fields)

tokens_field.build_vocab(train_dataset)
intent_field.build_vocab(train_dataset)

train_iter, val_iter, test_iter = BucketIterator.splits(datasets=(train_dataset, val_dataset, test_dataset), batch_sizes=(32, 128, 128), shuffle=True, sort=False)

Теперь опишем нейросетевую составляющую проекта:
+ Модель классификации запросов
+ Класс для её удобной тренировки
+ Обучение
+ Тестирование

In [10]:
# Модель будет базиролваться на испоьлзование рекуррентной нейросети
class Model(nn.Module):
    def __init__(self, size, labels_count):
        super().__init__()
        # Слой для векторного представления текста
        self.embeddings_layer = nn.Embedding(size, 64)
        # Двунаправаленная рекуррентная нейросеть
        self.lstm_layer = nn.LSTM(64, 128, batch_first=True, bidirectional=True, num_layers=1)
        # Вызодной линейный слой
        self.out = nn.Linear(256, labels_count)

    # Функция для получения вывода нейросети по входным данным
    def forward(self, inputs):
        words_vectors = self.embeddings_layer.forward(inputs)
        words_vectors = words_vectors.reshape(words_vectors.size(0), words_vectors.size(1), -1)
        
        _, (hidden_state, _) = self.lstm_layer(words_vectors)
        return self.out.forward(torch.cat((hidden_state[0], hidden_state[1]), dim=1))

In [11]:
# Класс для тренировки модели
class ModelTrainer():
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer
        
    def on_epoch_begin(self, is_train, name, batches_count):
        self.epoch_loss = 0
        self.correct_count, self.total_count = 0, 0
        self.is_train = is_train
        self.name = name
        self.batches_count = batches_count
        self.model.train(is_train)
        
    def on_epoch_end(self):
        return '{:>5s} Loss = {:.5f}, Accuracy = {:.2%}'.format(self.name, self.epoch_loss / self.batches_count, self.correct_count / self.total_count)
        
    def on_batch(self, batch):
        output = self.model(batch.tokens.transpose(0, 1))
        loss = self.criterion(output, batch.intent)
        predicted_class = torch.max(output, axis=1)[1]
        self.total_count += predicted_class.size(0)
        self.correct_count += torch.sum(predicted_class == batch.intent).item()
        if self.is_train:
            loss.backward()
            self.optimizer.step()
            self.optimizer.zero_grad()
        self.epoch_loss += loss.item()

In [12]:
# Широкоиспоьлзуемые функции для обучения нейросети
def do_epoch(trainer, data_iter, is_train, name=None):
    trainer.on_epoch_begin(is_train, name, batches_count=len(data_iter))
    with torch.autograd.set_grad_enabled(is_train):
        with tqdm(total=trainer.batches_count) as progress_bar:
            for i, batch in enumerate(data_iter):
                batch_progress = trainer.on_batch(batch)
                progress_bar.update()
                progress_bar.set_description(batch_progress)
                
            epoch_progress = trainer.on_epoch_end()
            progress_bar.set_description(epoch_progress)
            progress_bar.refresh()

def fit(trainer, train_iter, epochs_count=1, val_iter=None):
    best_val_loss = None
    for epoch in range(epochs_count):
        name_prefix = '[{} / {}] '.format(epoch + 1, epochs_count)
        do_epoch(trainer, train_iter, is_train=True, name=name_prefix + 'Train:')
        
        if not val_iter is None:
            do_epoch(trainer, val_iter, is_train=False, name=name_prefix + '  Val:')

In [13]:
# Запустим стандартный процесс обучения
model = Model(len(tokens_field.vocab), len(intent_field.vocab)).to(device) # Модель
criterion = nn.CrossEntropyLoss().to(device) # Функция ошибки
optimizer = optim.Adam(model.parameters()) # Оптимизатор
trainer = ModelTrainer(model, criterion, optimizer)
fit(trainer, train_iter = train_iter, epochs_count=16, val_iter=val_iter) # Обучение модели

[1 / 16] Train: Loss = 0.37851, Accuracy = 88.63%: 100%|██████████| 409/409 [00:16<00:00, 24.27it/s]
[1 / 16]   Val: Loss = 0.10530, Accuracy = 97.00%: 100%|██████████| 6/6 [00:00<00:00, 26.10it/s]
[2 / 16] Train: Loss = 0.08660, Accuracy = 97.39%: 100%|██████████| 409/409 [00:16<00:00, 24.75it/s]
[2 / 16]   Val: Loss = 0.07908, Accuracy = 97.43%: 100%|██████████| 6/6 [00:00<00:00, 28.15it/s]
[3 / 16] Train: Loss = 0.05040, Accuracy = 98.61%: 100%|██████████| 409/409 [00:18<00:00, 22.65it/s]
[3 / 16]   Val: Loss = 0.10154, Accuracy = 97.43%: 100%|██████████| 6/6 [00:00<00:00, 19.78it/s]
[4 / 16] Train: Loss = 0.03145, Accuracy = 99.10%: 100%|██████████| 409/409 [00:16<00:00, 24.76it/s]
[4 / 16]   Val: Loss = 0.12632, Accuracy = 97.57%: 100%|██████████| 6/6 [00:00<00:00, 26.71it/s]
[5 / 16] Train: Loss = 0.01807, Accuracy = 99.40%: 100%|██████████| 409/409 [00:16<00:00, 24.92it/s]
[5 / 16]   Val: Loss = 0.11476, Accuracy = 97.57%: 100%|██████████| 6/6 [00:00<00:00, 25.75it/s]
[6 / 16] T

In [14]:
# Посмотрим на метрику точности (accuracy) на тестовой выборке
do_epoch(trainer, test_iter, is_train=False, name='Test:')

Test: Loss = 0.23342, Accuracy = 95.86%: 100%|██████████| 6/6 [00:00<00:00, 25.73it/s]
