## Инструменты
- Python
- приближённый KNN (nmslib/faiss/scann)
- модели из Hugging Face (Transformers)

[Рейтинг русскоязычных энкодеров предложений](https://habr.com/ru/articles/669674/)

In [5]:
import pandas as pd
from transformers import AutoTokenizer, AutoModel
# нужно еще это тут
# pip install ipywidgets
import torch
import nmslib
# работает на версии python 3.8.17
# https://stackoverflow.com/questions/71242919/pip-install-results-in-this-error-cl-exe-failed-with-exit-code-2

## Цель
- реализовать задачу классификации на основе BERT-like модели и KNN на данных [Russian Intents Dataset](https://www.kaggle.com/datasets/constantinwerner/qa-intents-dataset-university-domain) с Kaggle.
- научиться создавать классификаторы текстов в условиях большого числа маленьких классов, состоящих из коротких текстов

## Ожидаемый результат
- код для создания поискового векторного индекса
- логика определения класса на основе близости к обучающим объектам (по ближайшему, по топ-N ближайших, и т. п.).

## Dataset

In [6]:
train = pd.read_table(
    'qa-intents-dataset-university-domain/dataset_train.tsv',
    header=None,names=['txt','c'])

In [7]:
train

Unnamed: 0,txt,c
0,мне нужна справка,statement_general
1,оформить справку,statement_general
2,взять справку,statement_general
3,справку как получить,statement_general
4,справку ммф где получаться,statement_general
...,...,...
13225,тупой,smalltalk_abuse
13226,робот бестолковый,smalltalk_abuse
13227,несообразительный,smalltalk_abuse
13228,ты бестолковый,smalltalk_abuse


In [8]:
test = pd.read_table(
    'qa-intents-dataset-university-domain/dataset_test.tsv',
    header=None,names=['txt','c'])

## Tokenizer

In [11]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output['last_hidden_state']
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

## KNN
### код для создания поискового векторного индекса

In [90]:
def get_embeding(MODEL_NAME, train, test):
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModel.from_pretrained(MODEL_NAME)

    train_token = tokenizer(train['txt'].tolist(), padding=True, return_tensors='pt', max_length = 32, truncation=True) 
    test_token = tokenizer(test['txt'].tolist(), padding=True, return_tensors='pt', max_length = 32, truncation=True) 

    with torch.no_grad():
        train_embeding = mean_pooling(model(**train_token), train_token['attention_mask'])
        test_embeding = mean_pooling(model(**test_token), test_token['attention_mask'])
    return train_embeding, test_embeding

In [86]:
def get_index(train_embeding):
    index = nmslib.init(method='hnsw', space='cosinesimil')
    index.addDataPointBatch(train_embeding, ids=list(range(len(train_embeding))))
    index.createIndex({'post': 2}, print_progress=True)
    return index

### логика определения класса на основе близости к обучающим объектам, а именно по ближайшему

In [88]:
def get_result(index, test_embeding, test, train):

    right = 0
    # кол-во совпадений

    for n, test_item in enumerate(test_embeding):
        target = test.iloc[n].c
        predicted = None
        id_, dist = index.knnQuery(test_item, k=1)
        dict_ = {}
        for i, d in zip(id_, dist):
            exist_ = 0
            if train.iloc[i].c in dict_.keys():
                exist_ = dict_[train.iloc[i].c]
            dict_[train.iloc[i].c] = exist_ + 1 - d
        predicted = max(dict_, key=dict_.get)

        if target == predicted:
            right += 1
        else:
            pass
        
    return right / len(test_embeding)

## sentence-transformers/paraphrase-xlm-r-multilingual-v1

In [91]:
MODEL_NAME = 'sentence-transformers/paraphrase-xlm-r-multilingual-v1'
train_embeding, test_embeding = get_embeding(MODEL_NAME, train, test)

In [92]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.8856172140430351

## sentence-transformers/all-MiniLM-L6-v2

In [93]:
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
train_embeding, test_embeding = get_embeding(MODEL_NAME, train, test)

In [94]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.9003397508493771

## intfloat/multilingual-e5-large

In [62]:
MODEL_NAME = 'intfloat/multilingual-e5-large'
train_embeding, test_embeding = get_embeding(MODEL_NAME, train, test)

Downloading (…)okenizer_config.json: 100%|██████████| 418/418 [00:00<00:00, 69.7kB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 6.37MB/s]
Downloading tokenizer.json: 100%|██████████| 17.1M/17.1M [00:02<00:00, 7.91MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 280/280 [00:00<00:00, 38.4kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 690/690 [00:00<00:00, 99.5kB/s]
Downloading model.safetensors: 100%|██████████| 2.24G/2.24G [04:07<00:00, 9.05MB/s]


In [64]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.942242355605889


## inkoziev/sbert_synonymy

In [65]:
MODEL_NAME = 'inkoziev/sbert_synonymy'
train_embeding, test_embeding = get_embeding(MODEL_NAME, train, test)

Downloading (…)okenizer_config.json: 100%|██████████| 410/410 [00:00<?, ?B/s] 
Downloading (…)solve/main/vocab.txt: 100%|██████████| 1.08M/1.08M [00:00<00:00, 2.19MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 2.41M/2.41M [00:00<00:00, 8.55MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 10.9kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 684/684 [00:00<00:00, 45.9kB/s]
Downloading pytorch_model.bin: 100%|██████████| 117M/117M [00:11<00:00, 9.88MB/s] 


In [67]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.9161947904869762


## cointegrated/rubert-tiny2

In [95]:
MODEL_NAME = 'cointegrated/rubert-tiny2'
train_embeding, test_embeding = get_embeding(MODEL_NAME, train, test)

In [96]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.8844847112117781

## ai-forever/sbert_large_nlu_ru

In [72]:
MODEL_NAME = 'ai-forever/sbert_large_nlu_ru'
train_embeding, test_embeding = get_embeding(MODEL_NAME, train, test)

Downloading (…)okenizer_config.json: 100%|██████████| 323/323 [00:00<00:00, 53.9kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 655/655 [00:00<?, ?B/s] 
Downloading (…)solve/main/vocab.txt: 100%|██████████| 1.78M/1.78M [00:00<00:00, 2.97MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 28.5kB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.71G/1.71G [02:52<00:00, 9.91MB/s]


In [75]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.8573046432616082


## xlm-roberta-base

In [76]:
MODEL_NAME = 'xlm-roberta-base'
train_embeding, test_embeding = get_embeding(MODEL_NAME, train, test)

Downloading (…)lve/main/config.json: 100%|██████████| 615/615 [00:00<00:00, 123kB/s]
Downloading (…)tencepiece.bpe.model: 100%|██████████| 5.07M/5.07M [00:00<00:00, 5.28MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 9.10M/9.10M [00:01<00:00, 8.75MB/s]
Downloading model.safetensors: 100%|██████████| 1.12G/1.12G [02:02<00:00, 9.14MB/s]


In [78]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.8731596828992072


## DeepPavlov/rubert-base-cased-sentence

In [79]:
MODEL_NAME = 'DeepPavlov/rubert-base-cased-sentence'
train_embeding, test_embeding = get_embeding(MODEL_NAME, train, test)

Downloading (…)okenizer_config.json: 100%|██████████| 24.0/24.0 [00:00<00:00, 6.00kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 642/642 [00:00<00:00, 161kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 1.65M/1.65M [00:00<00:00, 2.69MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 28.0kB/s]
Downloading pytorch_model.bin: 100%|██████████| 711M/711M [01:17<00:00, 9.18MB/s] 


In [81]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.8958097395243488


## bert-base-multilingual-cased

In [83]:
from transformers import BertTokenizer, BertModel

MODEL_NAME = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertModel.from_pretrained(MODEL_NAME)

train_token = tokenizer(train['txt'].tolist(), padding=True, return_tensors='pt', max_length = 32, truncation=True) 
test_token = tokenizer(test['txt'].tolist(), padding=True, return_tensors='pt', max_length = 32, truncation=True) 

with torch.no_grad():
    train_embeding = mean_pooling(model(**train_token), train_token['attention_mask'])
    test_embeding = mean_pooling(model(**test_token), test_token['attention_mask'])

Downloading model.safetensors: 100%|██████████| 714M/714M [01:13<00:00, 9.68MB/s] 


In [85]:
index = get_index(train_embeding)
result = get_result(index, test_embeding, test, train)
result

0.8720271800679502


## Итог
- базовые модели (bert-base-multilingual-cased, xlm-roberta-base) дают средний результат
- модели предназначенные чисто под русский язык (ai-forever/sbert_large_nlu_ru, cointegrated/rubert-tiny2) тоже не самые лучшие
- хороший результат дала авторская модель (inkoziev/sbert_synonymy) специально сделаная для сравнения похожести русских предложений
- и очень хороший результат (точность 94%) дала модель с большим объемом парамметров (intfloat/multilingual-e5-large)