## Демонстрация классификатора с энкодером FastText и иерархическим custom loss.

Импортируем необходимые библиотеки и модули, в том числе, модули мпровизированной HierarchicalLibrary, в которых содержатся необходимые для работы иерархического классификатора классы.

In [1]:
from time import time
import math
import os
from pathlib import Path
import pickle
import tqdm
import pandas as pd
import numpy as np
import csv
import fasttext
from gensim.utils import simple_preprocess
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import lr_scheduler
from sklearn.preprocessing import LabelEncoder

from HierarchicalLibrary import Classifier, CategoryTree, TextProcessor
from HierarchicalLibrary.Encoders import LdaEncoder, NavecEncoder, FasttextEncoder, BertEncoder

In [30]:
SEED = 1

# Method for increasing the weight of the first words of title
def word_pyramid(string: str, min_n_words: int, max_n_words: int) -> list:
    result = []
    split = string.split(' ')
    for i in range(min_n_words, max_n_words+1):
        result += split[:i]
    return ' '.join(result)

def prepare_data(full_train_data: pd.DataFrame, seed: int, valid_size: int):
    data_full = full_train_data.sample(frac=1, random_state=seed).copy()
    data_full.drop(['rating', 'feedback_quantity'], axis=1, inplace=True)
    data_full.title = data_full.title.astype('string')
    data_full.short_description = data_full.short_description.astype('string')
    data_full.fillna(value='', inplace=True)
    data_full.name_value_characteristics = data_full.name_value_characteristics.astype('string')
    data_full = data_full.assign(Document=[str(x) + ' ' + str(y) + ' ' + str(z) + ' ' + word_pyramid(x, 2, 3) for x, y, z in zip(data_full['title'], data_full['short_description'], data_full['name_value_characteristics'])])
    data_full.drop(['title', 'short_description', 'name_value_characteristics'], axis=1, inplace=True)
    data_full.Document = data_full.Document.astype('string')

    data = data_full[:-valid_size].reset_index(drop=True)
    data_valid = data_full[-valid_size:].reset_index(drop=True)
    return data, data_valid

def set_seeds(seed: int):  
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True


Загружаем данные

In [31]:
cat_tree_df = pd.read_csv('categories_tree.csv', index_col=0)
full_train_data = pd.read_parquet('train.parquet')

Подготавливаем полный, тренировочный и валидационный датасеты:
перемешиваем данные в фрейме,
удаляем колонки рейтинга и кол-ва отзывов,
корректируем типы данных колонок,
заполняем пропущенные значения,
текст из колонок 'title', 'short_description' и 'name_value_characteristics' объединяем в колонку "Document".

In [32]:
set_seeds(SEED)
data, data_valid = prepare_data(full_train_data, seed=SEED, valid_size=4000)

In [33]:
data

Unnamed: 0,id,category_id,Document
0,1181186,12350,Маска Masil для объёма волос 8ml /Корейская ко...
1,304936,12917,Силиконовый дорожный контейнер футляр чехол дл...
2,816714,14125,"Тканевая маска для лица с муцином улитки, 100%..."
3,1437391,11574,Браслеты из бисера Браслеты из бисера. Брасле...
4,1234938,12761,Бальзам HAUTE COUTURE LUXURY BLOND для блондир...
...,...,...,...
279447,564872,11635,"Крем-баттер для рук и тела MS.NAILS, 250 мл Кр..."
279448,1002594,12476,"Цепочка на шею, 50 см Красивые, легкие и очень..."
279449,988538,12302,Обложка на паспорт кожаная Кожаная обложка на ...
279450,1014080,13407,Открытка средняя двойная на татарском языке Р...


### Энкодер

Инициализируем объект процессора текста (это класс, который управляет лемматизацией и расчетами векторов скрытых представлений текстов, "эмбеддингов")

In [34]:
text_processor = TextProcessor(
    add_stop_words=[',', '.', '', '|', ':', '"', '/', ')', '(', 'a', 'х', '(:', '):', ':(', ':)', 'и']
)

Следующий код читает документы из датафрейма, выполняет токенизацию и лемматизацию средствами пакета natasha, затем, сохраняет леммы в собственную переменную TextProcessor.texts. 
Лемматизация выполняется достаточно долго, поэтому сохраняем данные на диск:

In [35]:
text_processor.simple_lemmatize_data(data, document_col='Document', id_col='id')
text_processor.save_lemms_data('full_train_data_s')

Lemmatize: 100%|██████████| 279452/279452 [00:15<00:00, 18034.98it/s]


Загружаем леммы с диска:

In [36]:
text_processor.load_lemms_data('full_train_data_s')

Загружаем обученную модель fasttext (обучена отдельно).

In [37]:
fasttext_encoder = FasttextEncoder()
fasttext_encoder.load_model('fasttext_model_45_s')



In [38]:
fasttext_encoder.transform([['foo']]).shape[1]

45

Используя встроенный метод энкодера, формируем словарь эмбеддингов товаров вида {good_id(int) : embedding(np.array)}. Передаем интересующие нас энкодеры. 

In [39]:
encoders=[fasttext_encoder]

In [40]:
embeddings_dict = text_processor.make_embeddings_dict(encoders=encoders)

Encoding: 0


Проверяем полную размерность составного эмбеддинга:

In [41]:
embeddings_dict[next(iter(embeddings_dict))].shape

(45,)

Сохраняем словарь эмбеддингов при необходимости - загружаем сохранённый:

In [25]:
#path = os.path.join(Path(".").parent, 'Hierarhical_no_catboost', '50000_set_embs_dict.pickle')
with open('embs_dict_all_enc.pickle', 'wb') as f:
    pickle.dump(embeddings_dict, f)

In [26]:
#path = os.path.join(Path(".").parent, 'Hierarhical_no_catboost', '50000_set_embs_dict.pickle')
with open('embs_dict_all_enc.pickle', 'rb') as f:
    embeddings_dict = pickle.load(f)

### Дерево каталога

Инициализируем дерево каталога - CategoryTree() - это класс, который хранит все узлы, необходимую информацию для обучения, а также реализует алгоритмы заполнения дерева, обхода при инференсе для определения категории товара. 
Добавляем узлы из таблицы categories_tree.csv, затем, добавляем товары из тренировочной выборки.

In [42]:
cat_tree = CategoryTree()
cat_tree.add_nodes_from_df(cat_tree_df, parent_id_col='parent_id', title_col='title')
cat_tree.add_goods_from_df(data, category_id_col='category_id', good_id_col='id')

Формируем массив эмбеддингов для тестирования

In [43]:
begin_example = 0
end_example = 4000
valid_documents = data_valid.Document.tolist()[begin_example:end_example]
valid_target = data_valid.category_id.tolist()[begin_example:end_example]
embs_valid = text_processor.get_embeddings(valid_documents, encoders=encoders)

## PyTorch

In [44]:
class KEdataset(Dataset):
    def __init__(self, data: pd.DataFrame, 
                 embeddings_dict: dict = None,
                 document_list: list = None,
                 encoders = None,
                 id_col: str = 'id', 
                 category_col: str = None, 
                 document_col = None,
                 text_processor: object = None,
                 mode: str = 'test',
                 label_encoder: object = None,
                 simple_lemms: bool = False) -> None:
        super().__init__()
        
        if embeddings_dict:
            self.X = torch.tensor(np.array(
                data[id_col].apply(lambda good_id: embeddings_dict[good_id]).tolist()), dtype=torch.float32)
        else:
            self.X = torch.tensor(text_processor.get_embeddings(
                data[document_col].tolist(), 
                encoders=encoders, 
                simple_lemms=simple_lemms), dtype=torch.float32)
        
        self.mode = mode

        if self.mode not in ['train', 'val', 'test']:
            print(f"{self.mode} is not correct; correct modes: {['train', 'val', 'test']}")
            raise NameError
            
        if label_encoder:
            self.label_encoder = label_encoder
        else:
            self.label_encoder = LabelEncoder()

        if self.mode == 'train':
            self.labels = data[category_col].tolist()
            self.label_encoder = LabelEncoder()
            self.label_encoder.fit(self.labels)
        elif self.mode == 'val':
            self.labels = data[category_col].tolist()
            self.label_encoder = label_encoder            
        return
        
    def __len__(self):
        return self.X.shape[0]
  
    def __getitem__(self, index):
        
        x = self.X[index]

        if self.mode == 'test':
            return x
        else:
            label = self.labels[index]
            label_id = self.label_encoder.transform([label])
            y = label_id.item()
            return x, y
    
    @property
    def dim(self):
        return train_dataset.X.shape[1]
        


In [45]:
train_dataset = KEdataset(data=data, 
                          embeddings_dict=embeddings_dict, 
                          id_col='id', 
                          category_col='category_id', 
                          mode='train')

valid_dataset = KEdataset(data=data_valid, 
                          encoders=encoders, 
                          document_col='Document', 
                          category_col='category_id', 
                          mode='val', 
                          text_processor=text_processor,
                          label_encoder=train_dataset.label_encoder, 
                          simple_lemms=True)


Один слой нейросети (аналогично тому, что используется в fasttext)

In [46]:
class Fcnn(nn.Module):
  
    def __init__(self, emb_dim: int, hidden_dim: int, n_classes: int, dropout: float = 0.3):
        super().__init__()
        self.out = nn.Linear(emb_dim, n_classes)
        return
  
    def forward(self, x):
        logits = self.out(x)
        return logits

Рассчитываем матрицу расстояний между листьями

In [189]:
def get_distance_matrix(cat_labels: np.array, encoder: object, power: float = 1.0) -> np.array:
    encode_labels = np.vstack((encoder.transform(cat_labels), cat_labels)).T
    distance_matrix = np.zeros((encode_labels.shape[0], encode_labels.shape[0]))

    for enc_label_1, label_1 in encode_labels:
        for enc_label_2, label_2 in encode_labels:
            path_1 = cat_tree.get_id_path_set(label_1)
            path_2 = cat_tree.get_id_path_set(label_2)
            intersect = path_1.intersection(path_2)
            value = (len(path_1)+len(path_1))/2 - len(intersect) + 1
            distance_matrix[enc_label_1][enc_label_2] = value
    
    mean_value = distance_matrix.mean()

    for enc_label, label in encode_labels:
        distance_matrix[enc_label][enc_label] = mean_value
        
    #distance_matrix = np.log(distance_matrix)
    return torch.tensor(distance_matrix / distance_matrix.mean())

In [190]:
cat_labels = data.category_id.value_counts().index.values
distance_matrix = get_distance_matrix(cat_labels=cat_labels, encoder=train_dataset.label_encoder, power=0.75)

In [191]:
def hierarhical_cross_entropy(input, target):
    log_prob = -1.0 * F.log_softmax(input*distance_matrix[target], 1)
    loss = log_prob.gather(1, target.unsqueeze(1))
    loss = loss.mean()
    return loss

def fit_epoch(model, train_loader, criterion, optimizer, sheduler, device: str = 'cpu'):
    running_loss = 0.0
    running_corrects = 0
    processed_data = 0
  
    for inputs, labels in train_loader:
        inputs = inputs.to(torch.device(device))
        labels = labels.to(torch.device(device))
        optimizer.zero_grad()

        outputs = model(inputs)
        #target = F.one_hot(labels, 1231).float()
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        preds = torch.argmax(outputs, 1)
        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)
        processed_data += inputs.size(0)
    sheduler.step()
    train_loss = running_loss / processed_data
    train_acc = running_corrects.cpu().numpy() / processed_data
    return train_loss, train_acc

def eval_epoch(model, val_loader, criterion, device: str = 'cpu'):
    model.eval()
    running_loss = 0.0
    running_corrects = 0
    processed_size = 0
    val_preds = []
    val_labels = []

    for inputs, labels in val_loader:
        inputs = inputs.to(torch.device(device))
        labels = labels.to(torch.device(device))

        with torch.set_grad_enabled(False):
            outputs = model(inputs)
            #target = F.one_hot(labels, 1231).float()
            loss = criterion(outputs, labels)
            preds = torch.argmax(outputs, 1)

        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)
        processed_size += inputs.size(0)
    val_loss = running_loss / processed_size
    val_acc = running_corrects.double() / processed_size
    val_preds = []
    val_labels = []
    return val_loss, val_acc

def train(train_dataset, 
          val_dataset, 
          model, epochs, 
          batch_size, 
          num_workers=0, 
          lr: float = 0.01, lr_mult: float = 0.1,
          weight_decay=0.0):
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers)

    history = []
    log_template = "\nEpoch {ep:03d} train_loss: {t_loss:0.4f} \
    val_loss {v_loss:0.4f} train_acc {t_acc:0.4f} val_acc {v_acc:0.4f}"

    with tqdm.tqdm(desc="epoch", total=epochs) as pbar_outer:
        opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
        gamma = lr_mult ** (2/epochs)
        sheduler = lr_scheduler.StepLR(opt, step_size=2, gamma=gamma, verbose=True)
        #criterion = nn.CrossEntropyLoss()
        criterion = hierarhical_cross_entropy
        
        for epoch in range(epochs):
            train_loss, train_acc = fit_epoch(model, train_loader, criterion, opt, sheduler)
            print("loss", train_loss)
            
            val_loss, val_acc = eval_epoch(model, val_loader, criterion)
            history.append((train_loss, train_acc, val_loss, val_acc))
            
            pbar_outer.update(1)
            tqdm.tqdm.write(log_template.format(ep=epoch+1, t_loss=train_loss,\
                                           v_loss=val_loss, t_acc=train_acc, v_acc=val_acc))
            
    return history

In [113]:
def predict(model, test_loader, device: str = 'cpu'):
    with torch.no_grad():
        logits = []
    
        for inputs in test_loader:
            inputs = inputs.to(torch.device(device))
            model.eval()
            outputs = model(inputs).cpu()
            logits.append(outputs)
            
    probs = nn.functional.softmax(torch.cat(logits), dim=-1).numpy()
    return probs

In [53]:
n_classes = len(np.unique(data.category_id.values))
simple_cnn = Fcnn(n_classes=n_classes, 
                  emb_dim=train_dataset.dim, 
                  hidden_dim=2048, dropout=0.3).to(torch.device("cpu"))

print("we will classify :{}".format(n_classes))
print(simple_cnn)

we will classify :1231
Fcnn(
  (out): Linear(in_features=45, out_features=1231, bias=True)
)


In [195]:
history = train(train_dataset, 
                 valid_dataset, 
                 model=simple_cnn, 
                 epochs=10, 
                 batch_size=18630, 
                 num_workers=0, 
                 lr=0.015, lr_mult = 0.01, 
                 weight_decay=0.3) 
#

epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Adjusting learning rate of group 0 to 1.5000e-02.
Adjusting learning rate of group 0 to 1.5000e-02.
loss 0.007412626490913459


epoch:  10%|█         | 1/10 [01:03<09:34, 63.88s/it]


Epoch 001 train_loss: 0.0074     val_loss 0.3093 train_acc 0.9774 val_acc 0.8530
Adjusting learning rate of group 0 to 5.9716e-03.
loss 0.00690826336192386


epoch:  20%|██        | 2/10 [01:55<07:32, 56.56s/it]


Epoch 002 train_loss: 0.0069     val_loss 0.3041 train_acc 0.9771 val_acc 0.8575
Adjusting learning rate of group 0 to 5.9716e-03.
loss 0.006287892233859151


epoch:  30%|███       | 3/10 [02:45<06:15, 53.66s/it]


Epoch 003 train_loss: 0.0063     val_loss 0.3045 train_acc 0.9780 val_acc 0.8585
Adjusting learning rate of group 0 to 2.3773e-03.
loss 0.006100818360407995


epoch:  40%|████      | 4/10 [03:48<05:43, 57.23s/it]


Epoch 004 train_loss: 0.0061     val_loss 0.3046 train_acc 0.9795 val_acc 0.8585
Adjusting learning rate of group 0 to 2.3773e-03.
loss 0.0060397277149275555


epoch:  50%|█████     | 5/10 [04:55<05:03, 60.68s/it]


Epoch 005 train_loss: 0.0060     val_loss 0.3038 train_acc 0.9798 val_acc 0.8605
Adjusting learning rate of group 0 to 9.4644e-04.
loss 0.006020331431680154


epoch:  60%|██████    | 6/10 [06:02<04:12, 63.07s/it]


Epoch 006 train_loss: 0.0060     val_loss 0.3033 train_acc 0.9799 val_acc 0.8602
Adjusting learning rate of group 0 to 9.4644e-04.
loss 0.005965916720878712


epoch:  70%|███████   | 7/10 [07:09<03:13, 64.40s/it]


Epoch 007 train_loss: 0.0060     val_loss 0.3031 train_acc 0.9799 val_acc 0.8608
Adjusting learning rate of group 0 to 3.7678e-04.
loss 0.005965177740283898


epoch:  80%|████████  | 8/10 [08:17<02:10, 65.49s/it]


Epoch 008 train_loss: 0.0060     val_loss 0.3031 train_acc 0.9797 val_acc 0.8605
Adjusting learning rate of group 0 to 3.7678e-04.
loss 0.005941477542317348


epoch:  90%|█████████ | 9/10 [09:26<01:06, 66.47s/it]


Epoch 009 train_loss: 0.0059     val_loss 0.3029 train_acc 0.9799 val_acc 0.8608
Adjusting learning rate of group 0 to 1.5000e-04.
loss 0.005941414092445585


epoch: 100%|██████████| 10/10 [10:34<00:00, 63.46s/it]


Epoch 010 train_loss: 0.0059     val_loss 0.3029 train_acc 0.9799 val_acc 0.8608





### Тестирование модели

In [196]:
test_dataset = KEdataset(data=data_valid, 
                         encoders=encoders, 
                         document_col='Document', 
                         category_col='category_id', 
                         mode='test', 
                         text_processor=text_processor,
                         label_encoder=train_dataset.label_encoder, 
                         simple_lemms=True)
val_preds = predict(simple_cnn, 
                    DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=0)
                    )

In [197]:
pred_torch_valid = list(train_dataset.label_encoder.inverse_transform(val_preds.argmax(axis=1)))
print(f'Validation hF1={cat_tree.hF1_score(valid_target, pred_torch_valid):.3f}')

Validation hF1=0.916
