## Демонстрация иерархического классификатора без CatBoost.

Импортируем необходимые библиотеки и модули, в том числе, модули мпровизированной KazanExpressLibrary, в которых содержатся необходимые для работы иерархического классификатора классы.

In [1]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
import pickle
import tqdm

from HierarhicalLibrary import Classifier, CategoryTree, Encoder

Загружаем данные

In [2]:
cat_tree_df = pd.read_csv('categories_tree.csv', index_col=0)
full_train_data = pd.read_parquet('train.parquet')

Подготавливаем полный, тренировочный и валидационный датасеты:
перемешиваем данные в фрейме,
удаляем колонки рейтинга и кол-ва отзывов,
корректируем типы данных колонок,
заполняем пропущенные значения,
текст из колонок 'title', 'short_description' и 'name_value_characteristics' объединяем в колонку "Document", колонку 'title' берём дважды, чтобы увеличить её вес.

In [3]:
data_full = full_train_data.sample(frac=1, random_state=1).copy()
data_full.drop(['rating', 'feedback_quantity'], axis=1, inplace=True)
data_full.title = data_full.title.astype('string')
data_full.short_description = data_full.short_description.astype('string')
data_full.fillna(value='', inplace=True)
data_full.name_value_characteristics = data_full.name_value_characteristics.astype('string')
data_full = data_full.assign(Document=[str(x) + ' ' + str(y) + ' ' + str(z) + ' ' + str(x) for x, y, z in zip(data_full['title'], data_full['short_description'], data_full['name_value_characteristics'])])
data_full.drop(['title', 'short_description', 'name_value_characteristics'], axis=1, inplace=True)
data_full.Document = data_full.Document.astype('string')

data = data_full[:50000].reset_index(drop=True)
data_valid = data_full[-4000:].reset_index(drop=True)

Для ускорения расчетов, оставим только 50000 записей, иначе, считать будет долго.

In [4]:
data

Unnamed: 0,id,category_id,Document
0,1181186,12350,Маска Masil для объёма волос 8ml /Корейская ко...
1,304936,12917,Силиконовый дорожный контейнер футляр чехол дл...
2,816714,14125,"Тканевая маска для лица с муцином улитки, 100%..."
3,1437391,11574,Браслеты из бисера Браслеты из бисера. Брасле...
4,1234938,12761,Бальзам HAUTE COUTURE LUXURY BLOND для блондир...
...,...,...,...
49995,1291099,12488,"Комплект постельного белья Считалочка, 1.5 сп,..."
49996,992089,13816,Патчи гля глаз кружевные LOVE Beauty Fox с му...
49997,529715,13613,"Пресс для чеснока MODERNO, прорезиненная ручка..."
49998,750317,12228,"Косметичка полиэстер/ПВХ розовая 19,5*11,5*11,..."


### Энкодер

Инициализируем объект энкодера (это класс, который управляет расчетами векторов скрытых представлений текстов, "эмбеддингов")

In [5]:
encoder = Encoder()

Следующий код читает документы из датафрейма, выполняет токенизацию и лемматизацию средствами пакета natasha, затем, сохраняет леммы в собственную переменную Encoder.texts. Лемматизация выполняется достаточно долго, поэтому сохраняем данные на диск:

In [6]:
encoder.lemmatize_data(data, document_col='Document', id_col='id')
encoder.save_lemms_data('50000_set_lemm', directory='Hierarhical_no_catboost')

Lemmatize: 100%|██████████| 50000/50000 [05:07<00:00, 162.49it/s]


Загружаем леммы с диска:

In [7]:
encoder.load_lemms_data('50000_set_lemm', directory='Hierarhical_no_catboost')

Выполняем тренировку LDA модели gensim (скажем, на 64 темы, чтобы побыстрее работало) и сразу сохраняем на диск, модель тренируется долго:

In [8]:
encoder.fit_lda_model(num_topics=64, passes=5, iterations=2)
encoder.save_lda_model('50000_set_model_64', directory='Hierarhical_no_catboost')

Загружаем модель с диска:

In [9]:
encoder.load_lda_model('50000_set_model_64', directory='Hierarhical_no_catboost')

Загружаем обученную модель navec (скачана из родного репозитория).

In [10]:
encoder.load_navec_model('navec_hudlit_v1_12B_500K_300d_100q.tar')

В случае необходимости, считаем и сохраняем матрицу снижения размерности эмбеддингов word2vec (например, на 128 векторов).

In [11]:
encoder.calc_PCA_matrix(dim=128, sample_size=10000)
encoder.save_pca_matrix('50000_set_PCA_128.pickle', directory='Hierarhical_no_catboost')

Загружаем матрицу для понижения размерности word2vec эмбеддингов (понижение размерности выполнено для увеличения производительности, если есть желание отключить понижение размерности - можно передать энкодеру единичную матрицу размера 300х300)

In [12]:
encoder.load_pca_matrix('50000_set_PCA_128.pickle', directory='Hierarhical_no_catboost')

Или передаёем единичную матрицу, чтобы отключить снижение размерности.

In [13]:
#encoder.PCA_matrix=np.eye(300)

Проверяем размерность:

In [14]:
encoder.PCA_matrix.shape

(128, 300)

Используя встроенный метод энкодера, формируем словарь эмбеддингов товаров вида {good_id(int) : embedding(np.array)}. Передаем интересующие нас функции - энкодеры LDA и Word2vec. Параметр экспоненциального взвешивания эмбеддингов word2vec, alpha=0.25.

In [15]:
encoders=[lambda texts: encoder.lda_encoder(texts),
          lambda texts: encoder.doc2vec_encoder(texts, alpha=0.25)]

In [16]:
embeddings_dict = encoder.make_embeddings_dict(encoders=encoders)

Сохраняем словарь эмбеддингов при необходимости - загружаем сохранённый:

In [17]:
path = os.path.join(Path(".").parent, 'Hierarhical_no_catboost', '50000_set_embs_dict.pickle')
with open(path, 'wb') as f:
    pickle.dump(embeddings_dict, f)

In [18]:
path = os.path.join(Path(".").parent, 'Hierarhical_no_catboost', '50000_set_embs_dict.pickle')
with open(path, 'rb') as f:
    embeddings_dict = pickle.load(f)

### Дерево каталога

Инициализируем дерево каталога - CategoryTree() - это класс, который хранит все узлы, необходимую информацию для обучения, а также реализует алгоритмы заполнения дерева, обхода при инференсе для определения категории товара. 
Добавляем узлы из таблицы categories_tree.csv, затем, добавляем товары из тренировочной выборки.

In [19]:
cat_tree = CategoryTree()
cat_tree.add_nodes_from_df(cat_tree_df, parent_id_col='parent_id', title_col='title')
cat_tree.add_goods_from_df(data, category_id_col='category_id', good_id_col='id')

Записываем эмбеддинги в дерево каталогов (производится расчет эмбеддингов узлов как усреднённых эмбеддингов документов, попавших в каждый узел):

In [20]:
cat_tree.update_embeddings(embeddings_dict)

Примешиваем к эмбеддингам узлов эмбеддинги их собственных описаний.

In [21]:
cat_tree.mix_in_description_embs(lambda titles: encoder.get_embeddings(titles, encoders=encoders), weight=5)

### Классификатор

Инициализируем объект классификатора - он управляет процессом получения вероятностей принадлежности товара к узлу (predict_proba). CatBoost больше не нужен - просто небудем его инициализировать..

In [22]:
classifier = Classifier(tol=0.1, max_iter=50)

Обучаем локальные веса (модель логистической регрессии в каждом из узлов дерева). Сохраняем дерево (так как считается очень долго). При необходимости - загружаем.

In [23]:
cat_tree.fit_local_weights(classifier, embeddings_dict, C=0.05, reg_count_power=0.5)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Сохраняем, и при необходимости, загружаем дерево с рассчитанными весами.

In [24]:
cat_tree.save_tree('50000_set_tree.pickle', directory='Hierarhical_no_catboost')

In [25]:
cat_tree.load_tree('50000_set_tree.pickle', directory='Hierarhical_no_catboost')

### Тестирование модели

#### Тестирование на трейне

Формируем массив эмбеддингов для тестирования

In [26]:
begin_example = 0
end_example = 3000
train_documents = data.Document.tolist()[begin_example:end_example]
train_target = data.category_id.tolist()[begin_example:end_example]
embs = encoder.get_embeddings(train_documents, encoders=encoders)

Выполняем поиск категорий по каталогу для каждого тестового примера

In [27]:
pred_leafs = []
for i in tqdm.tqdm(range(len(embs)), total=len(embs)):
    pred_leafs.append(cat_tree.choose_leaf(embs[i], classifier))

100%|██████████| 3000/3000 [00:11<00:00, 259.95it/s]


In [28]:
print(f'Train set hF1={cat_tree.hF1_score(train_target, pred_leafs):.3f}') 

Train set hF1=0.838


#### Тестирование на отложенной выборке

Формируем массив эмбеддингов для тестирования

In [29]:
begin_example = 0
end_example = 3000
valid_documents = data_valid.Document.tolist()[begin_example:end_example]
valid_target = data_valid.category_id.tolist()[begin_example:end_example]
embs_valid = encoder.get_embeddings(valid_documents, encoders=encoders)

Выполняем поиск категорий по каталогу для каждого тестового примера

In [30]:
pred_leafs_valid = []
for i in tqdm.tqdm(range(len(embs_valid)), total=len(embs_valid)):
    pred_leafs_valid.append(cat_tree.choose_leaf(embs_valid[i], classifier))

100%|██████████| 3000/3000 [00:11<00:00, 258.07it/s]


In [31]:
print(f'Validation hF1={cat_tree.hF1_score(valid_target, pred_leafs_valid):.3f}') 

Validation hF1=0.809


В этом ноутбуке гиперпараметры и размер выборки выбраны такими, чтобы расчёты выполнялись относительно быстро. С хорошими гиперпараметрами, на полном размере выборки, удалось получить hF1=0.86, что значительно ниже бейзлайна. 