**Проект для «Викишоп» c ST и BERT**

**Описание проекта:** 
Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 
Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.
Постройте модель со значением метрики качества F1 не меньше 0.75. 


**Цель исследования:** Обучить модель классифицировать комментарии на позитивные и негативные. 

**Ход исследования:**
- предобработка данных
- создание эмбеддингов
- обучение моделей: LR, RFC, LGMBC
- определение лучшей и подсчет метрики F1 на лучшей модели


In [1]:
%pip install scikit-learn -q -U 
%pip install sentence-transformers lightgbm optuna transformers tqdm -q
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import warnings
import torch
import optuna

from typing import List

from tqdm import tqdm

from sentence_transformers import SentenceTransformer
from transformers.tokenization_utils_base import BatchEncoding
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import Dataset, DataLoader

from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from lightgbm import LGBMClassifier
from lightgbm import early_stopping

In [3]:
TEST_SIZE = 0.2
RANDOM_STATE = 42
BATCH_SIZE = 32

In [4]:
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

## Первичная предобработка данных

In [5]:
data = pd.read_csv(r'../datasets/toxic_comments.csv')

data.info()
display(data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,4,"You, sir, are my hero. Any chance you remember what page that's on?",0


In [6]:
data = data.drop('Unnamed: 0', axis=1)

In [7]:
data.head()

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0


In [8]:
data.duplicated().sum()

0

In [9]:
print(data['toxic'].value_counts())
print()
print(data['toxic'].value_counts(normalize=True))

0    143106
1     16186
Name: toxic, dtype: int64

0    0.898388
1    0.101612
Name: toxic, dtype: float64


Проверили данные на дубликаты и избавились от проблемного столбца, который отвечал за индексацию. (в нем не все значения идут по порядку)<br>
Также можем наблюдать сильный дисбаланс в классах. Одного класса практически 90% от всей выборки.

In [10]:
def stratified_sample(df: pd.DataFrame, target_col: str, total_n: int) -> pd.DataFrame:
    return (
        df
        .groupby(target_col, group_keys=False)
        .apply(lambda x: x.sample(int(len(x) / len(df) * total_n), random_state=RANDOM_STATE))
        .reset_index(drop=True)
    )

In [11]:
mini_df = stratified_sample(data, 'toxic', 2_501)
print(mini_df['toxic'].value_counts(normalize=True))
mini_df.shape

0    0.8984
1    0.1016
Name: toxic, dtype: float64


(2500, 2)

Так как проект для учебных целей сократим выборку с сохранением пропорций классов, чтобы энергозатратные модели отработали быстрее

## Применение модели SentenceTransformer

In [12]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Используемое устройство: {device}')

Используемое устройство: cuda


In [13]:
model_st = SentenceTransformer("all-MiniLM-L6-v2", device=device)

embeddings = model_st.encode(
                    mini_df['text'].to_list(),
                    convert_to_numpy=True,
                    normalize_embeddings=True,
                    show_progress_bar=True
                    )

print(f'Форма эмбеддингов: {embeddings.shape}')

Batches:   0%|          | 0/79 [00:00<?, ?it/s]

Форма эмбеддингов: (2500, 384)


In [14]:
temp_df = pd.DataFrame(
    np.hstack((embeddings, mini_df['toxic'].to_numpy() \
                                        .reshape(-1, 1)))
    )
print(f'Количество дубликатов: {temp_df.duplicated().sum()}')
temp_df.drop_duplicates(inplace=True)
print(f'Количество дубликатов после удаления: {temp_df.duplicated().sum()}')

X_clean = temp_df.iloc[:, :-1].to_numpy()
y_clean = temp_df.iloc[:, -1].to_numpy()

Количество дубликатов: 0
Количество дубликатов после удаления: 0


In [15]:
X_train_full, X_test, y_train_full, y_test = train_test_split(
    X_clean, y_clean, 
    random_state=RANDOM_STATE, 
    stratify=y_clean, test_size=TEST_SIZE)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, 
    random_state=RANDOM_STATE, 
    stratify=y_train_full, test_size=TEST_SIZE)

In [16]:
print(X_train.shape, X_test.shape, X_val.shape)

(1600, 384) (500, 384) (400, 384)


### Обучение моделей

#### LR

In [17]:
def obj_logreg(trial):
    params = {
        'random_state': RANDOM_STATE,
        'C': trial.suggest_float('C', 0.1, 2)
    }

    model = LogisticRegression(**params).fit(X_train, y_train)

    y_preds = model.predict(X_val)
    f1 = f1_score(y_val, y_preds, average='binary')
    return f1

In [18]:
study_logreg = optuna.create_study(direction='maximize')
study_logreg.optimize(obj_logreg, n_trials=10)

[I 2025-07-05 21:36:50,580] A new study created in memory with name: no-name-b2ee461f-7d97-48b4-b652-7c10ab9e5c4e
[I 2025-07-05 21:36:50,602] Trial 0 finished with value: 0.6333333333333333 and parameters: {'C': 1.480688432722877}. Best is trial 0 with value: 0.6333333333333333.
[I 2025-07-05 21:36:50,626] Trial 1 finished with value: 0.6101694915254238 and parameters: {'C': 1.305990380470312}. Best is trial 0 with value: 0.6333333333333333.
[I 2025-07-05 21:36:50,642] Trial 2 finished with value: 0.5614035087719298 and parameters: {'C': 0.9102678223617086}. Best is trial 0 with value: 0.6333333333333333.
[I 2025-07-05 21:36:50,658] Trial 3 finished with value: 0.09302325581395349 and parameters: {'C': 0.24600445978795052}. Best is trial 0 with value: 0.6333333333333333.
[I 2025-07-05 21:36:50,671] Trial 4 finished with value: 0.4230769230769231 and parameters: {'C': 0.5656964163128269}. Best is trial 0 with value: 0.6333333333333333.
[I 2025-07-05 21:36:50,688] Trial 5 finished with v

#### RFC

In [19]:
def obj_rfc(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 200),
        'class_weight': 'balanced',
        'criterion': trial.suggest_categorical('criterion', ['log_loss', 'gini', 'entropy']),
        'random_state': RANDOM_STATE,
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 25),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
        'max_depth': trial.suggest_int('max_depth', 10, 40),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2']),
    }

    model = RandomForestClassifier(**params).fit(X_train, y_train)

    preds = model.predict(X_val)

    return f1_score(y_val, preds, average='binary')

In [20]:
study_rfc = optuna.create_study(direction='maximize')
study_rfc.optimize(obj_rfc, n_trials=25)

[I 2025-07-05 21:36:50,824] A new study created in memory with name: no-name-65a8a808-22ae-4781-a2c5-a13f67d8d84c
[I 2025-07-05 21:36:51,919] Trial 0 finished with value: 0.17777777777777778 and parameters: {'n_estimators': 101, 'criterion': 'log_loss', 'min_samples_split': 5, 'min_samples_leaf': 5, 'max_depth': 40, 'max_features': 'log2'}. Best is trial 0 with value: 0.17777777777777778.
[I 2025-07-05 21:36:54,475] Trial 1 finished with value: 0.38461538461538464 and parameters: {'n_estimators': 124, 'criterion': 'log_loss', 'min_samples_split': 14, 'min_samples_leaf': 8, 'max_depth': 20, 'max_features': 'sqrt'}. Best is trial 1 with value: 0.38461538461538464.
[I 2025-07-05 21:36:56,628] Trial 2 finished with value: 0.2857142857142857 and parameters: {'n_estimators': 102, 'criterion': 'entropy', 'min_samples_split': 15, 'min_samples_leaf': 6, 'max_depth': 35, 'max_features': 'sqrt'}. Best is trial 1 with value: 0.38461538461538464.
[I 2025-07-05 21:36:59,490] Trial 3 finished with va

#### LGBMC

In [21]:
def obj_lgbmc(trial):
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'random_state': RANDOM_STATE,
        'n_estimators': 10000, 
        'verbose': -1,
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'num_leaves': trial.suggest_int('num_leaves', 20, 3000),
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'subsample': trial.suggest_float('subsample', 0.4, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.4, 1.0)
    }

    model = LGBMClassifier(**params)
    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        callbacks=[early_stopping(stopping_rounds=100)],
    )
    preds = model.predict(X_val)

    return f1_score(y_val, preds, average='binary')



In [22]:
study_lgbmc = optuna.create_study(direction='maximize')
study_lgbmc.optimize(obj_lgbmc, n_trials=25)

[I 2025-07-05 21:37:33,854] A new study created in memory with name: no-name-21034ecf-73ff-483d-9062-f0a4b5d84eca


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:34,736] Trial 0 finished with value: 0.6129032258064516 and parameters: {'learning_rate': 0.2964428483809943, 'num_leaves': 887, 'max_depth': 13, 'min_child_samples': 70, 'subsample': 0.46878563849778887, 'colsample_bytree': 0.4087034875590403}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[57]	valid_0's auc: 0.920443
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:36,147] Trial 1 finished with value: 0.5517241379310345 and parameters: {'learning_rate': 0.044079906937748506, 'num_leaves': 173, 'max_depth': 13, 'min_child_samples': 99, 'subsample': 0.9094544712763762, 'colsample_bytree': 0.5293593591015506}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[275]	valid_0's auc: 0.914532
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[76]	valid_0's auc: 0.908689


[I 2025-07-05 21:37:37,828] Trial 2 finished with value: 0.41509433962264153 and parameters: {'learning_rate': 0.17502362324414514, 'num_leaves': 1884, 'max_depth': 5, 'min_child_samples': 40, 'subsample': 0.5630939256781816, 'colsample_bytree': 0.9326520075572211}. Best is trial 0 with value: 0.6129032258064516.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[111]	valid_0's auc: 0.909097


[I 2025-07-05 21:37:39,716] Trial 3 finished with value: 0.5517241379310345 and parameters: {'learning_rate': 0.14822877748572488, 'num_leaves': 1875, 'max_depth': 15, 'min_child_samples': 72, 'subsample': 0.7253926420193648, 'colsample_bytree': 0.6286142733308644}. Best is trial 0 with value: 0.6129032258064516.


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:42,195] Trial 4 finished with value: 0.5573770491803278 and parameters: {'learning_rate': 0.13855291015198, 'num_leaves': 1077, 'max_depth': 13, 'min_child_samples': 84, 'subsample': 0.7095237558825863, 'colsample_bytree': 0.9923123796632726}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[258]	valid_0's auc: 0.900265
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:44,921] Trial 5 finished with value: 0.5517241379310345 and parameters: {'learning_rate': 0.0503503619257042, 'num_leaves': 706, 'max_depth': 15, 'min_child_samples': 76, 'subsample': 0.6497296511410293, 'colsample_bytree': 0.8550140052814973}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[255]	valid_0's auc: 0.91032
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[91]	valid_0's auc: 0.902915


[I 2025-07-05 21:37:46,299] Trial 6 finished with value: 0.6 and parameters: {'learning_rate': 0.2681333201878062, 'num_leaves': 1817, 'max_depth': 12, 'min_child_samples': 39, 'subsample': 0.9663723584977324, 'colsample_bytree': 0.4098093076701013}. Best is trial 0 with value: 0.6129032258064516.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[65]	valid_0's auc: 0.895102


[I 2025-07-05 21:37:47,636] Trial 7 finished with value: 0.5263157894736842 and parameters: {'learning_rate': 0.29572093838316466, 'num_leaves': 2288, 'max_depth': 13, 'min_child_samples': 57, 'subsample': 0.8457470609841795, 'colsample_bytree': 0.5214554866669106}. Best is trial 0 with value: 0.6129032258064516.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[151]	valid_0's auc: 0.913921


[I 2025-07-05 21:37:49,116] Trial 8 finished with value: 0.5263157894736842 and parameters: {'learning_rate': 0.06370456387266822, 'num_leaves': 2832, 'max_depth': 4, 'min_child_samples': 64, 'subsample': 0.44426374321088435, 'colsample_bytree': 0.40357262280082923}. Best is trial 0 with value: 0.6129032258064516.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[203]	valid_0's auc: 0.90604


[I 2025-07-05 21:37:50,811] Trial 9 finished with value: 0.5263157894736842 and parameters: {'learning_rate': 0.04054319889821671, 'num_leaves': 2904, 'max_depth': 3, 'min_child_samples': 8, 'subsample': 0.7966184743614002, 'colsample_bytree': 0.6103514411607986}. Best is trial 0 with value: 0.6129032258064516.


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:52,738] Trial 10 finished with value: 0.4528301886792453 and parameters: {'learning_rate': 0.22585191399913523, 'num_leaves': 992, 'max_depth': 8, 'min_child_samples': 14, 'subsample': 0.40432526922380013, 'colsample_bytree': 0.7796282301807699}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[207]	valid_0's auc: 0.894558
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:53,875] Trial 11 finished with value: 0.4482758620689655 and parameters: {'learning_rate': 0.2872040768897032, 'num_leaves': 1442, 'max_depth': 10, 'min_child_samples': 33, 'subsample': 0.9873714516744122, 'colsample_bytree': 0.4018069297950196}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[44]	valid_0's auc: 0.89877
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:55,105] Trial 12 finished with value: 0.576271186440678 and parameters: {'learning_rate': 0.24582020581488417, 'num_leaves': 349, 'max_depth': 10, 'min_child_samples': 40, 'subsample': 0.5608299214460762, 'colsample_bytree': 0.4859313225480934}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[339]	valid_0's auc: 0.920171
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:56,984] Trial 13 finished with value: 0.5357142857142857 and parameters: {'learning_rate': 0.23567085993632098, 'num_leaves': 1419, 'max_depth': 11, 'min_child_samples': 27, 'subsample': 0.5278834163513705, 'colsample_bytree': 0.6614047962157218}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[259]	valid_0's auc: 0.913377
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[88]	valid_0's auc: 0.925199


[I 2025-07-05 21:37:58,610] Trial 14 finished with value: 0.576271186440678 and parameters: {'learning_rate': 0.1961752787384082, 'num_leaves': 2125, 'max_depth': 12, 'min_child_samples': 51, 'subsample': 0.9735506952853414, 'colsample_bytree': 0.4634243817052158}. Best is trial 0 with value: 0.6129032258064516.


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:37:59,771] Trial 15 finished with value: 0.5517241379310345 and parameters: {'learning_rate': 0.2767682600822392, 'num_leaves': 658, 'max_depth': 8, 'min_child_samples': 49, 'subsample': 0.6305470605977068, 'colsample_bytree': 0.5756072923188211}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[124]	valid_0's auc: 0.897683
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:38:00,915] Trial 16 finished with value: 0.6129032258064516 and parameters: {'learning_rate': 0.2605051157323687, 'num_leaves': 1129, 'max_depth': 7, 'min_child_samples': 91, 'subsample': 0.7828003096355203, 'colsample_bytree': 0.6959398134263958}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[114]	valid_0's auc: 0.922277
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:38:02,133] Trial 17 finished with value: 0.5666666666666667 and parameters: {'learning_rate': 0.11539643755193328, 'num_leaves': 1064, 'max_depth': 6, 'min_child_samples': 100, 'subsample': 0.7964311146636417, 'colsample_bytree': 0.7816785561299159}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[52]	valid_0's auc: 0.903458
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:38:03,389] Trial 18 finished with value: 0.5901639344262295 and parameters: {'learning_rate': 0.2023893956894213, 'num_leaves': 621, 'max_depth': 6, 'min_child_samples': 86, 'subsample': 0.5026635240653048, 'colsample_bytree': 0.7499142016704148}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[117]	valid_0's auc: 0.90407
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:38:04,147] Trial 19 finished with value: 0.49122807017543857 and parameters: {'learning_rate': 0.2544658814864953, 'num_leaves': 57, 'max_depth': 8, 'min_child_samples': 87, 'subsample': 0.6213249740525749, 'colsample_bytree': 0.7173605890384489}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[22]	valid_0's auc: 0.931177
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:38:05,439] Trial 20 finished with value: 0.5901639344262295 and parameters: {'learning_rate': 0.29891183083290357, 'num_leaves': 1105, 'max_depth': 7, 'min_child_samples': 69, 'subsample': 0.7703372380106982, 'colsample_bytree': 0.8265218072582425}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[139]	valid_0's auc: 0.901896
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:38:06,927] Trial 21 finished with value: 0.5263157894736842 and parameters: {'learning_rate': 0.2684364038403322, 'num_leaves': 1632, 'max_depth': 11, 'min_child_samples': 23, 'subsample': 0.8670403258214722, 'colsample_bytree': 0.45270324306146625}. Best is trial 0 with value: 0.6129032258064516.


Early stopping, best iteration is:
[89]	valid_0's auc: 0.902099
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[187]	valid_0's auc: 0.913173


[I 2025-07-05 21:38:08,810] Trial 22 finished with value: 0.6229508196721312 and parameters: {'learning_rate': 0.2143938987851304, 'num_leaves': 2561, 'max_depth': 14, 'min_child_samples': 60, 'subsample': 0.9276473302618113, 'colsample_bytree': 0.5581082025258487}. Best is trial 22 with value: 0.6229508196721312.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[78]	valid_0's auc: 0.904545


[I 2025-07-05 21:38:10,647] Trial 23 finished with value: 0.5 and parameters: {'learning_rate': 0.20698451633742382, 'num_leaves': 2624, 'max_depth': 14, 'min_child_samples': 61, 'subsample': 0.8790194412382781, 'colsample_bytree': 0.5746419481325565}. Best is trial 22 with value: 0.6229508196721312.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[110]	valid_0's auc: 0.912018


[I 2025-07-05 21:38:12,844] Trial 24 finished with value: 0.5806451612903226 and parameters: {'learning_rate': 0.23143639751001902, 'num_leaves': 2367, 'max_depth': 14, 'min_child_samples': 76, 'subsample': 0.9249220355981248, 'colsample_bytree': 0.7037134496964397}. Best is trial 22 with value: 0.6229508196721312.


In [23]:
print(f'LR: {study_logreg.best_value:.4f}')
print(f'RFC: {study_rfc.best_value:.4f}')
print(f'LGBM: {study_lgbmc.best_value:.4f}')

LR: 0.6333
RFC: 0.6364
LGBM: 0.6230


Как мы можем заметить, модели обученные на эмбеддингах от SentenceTransformer отработали не лучшем образом. Их метрика f1 ниже необходимого 0.75. Вероятнее всего, проблема в том, что данная модель обучалась на задачах, специфичных для семантического сходства предложений (Natural Language Inference, Semantic Textual Similarity), что делает эмбеддинги более подходящими для сравнения предложений, а не классификации тональности сообщений.<br><br>
Поэтому создадим эмбеддинги с помощью дообученной модели BERT.

## Токенизация и применение модели BERT

In [24]:
#Если есть GPU, то создаем объект device и прописываем его равным cuda
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Используемое устройство:', device)

Используемое устройство: cuda


In [25]:
tokenizer = AutoTokenizer.from_pretrained('unitary/toxic-bert')
model_bert = AutoModel.from_pretrained('unitary/toxic-bert')

model_bert.to(device) # переключаем модель на GPU, если он есть
model_bert.eval() # переключаем модель для инференса

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [26]:
#Класс для преобразования списка строк в специальный объект для DataLoader
class TextDataset(Dataset):
    def __init__(self, text: List[str]) -> None:
        self.text = text

    def __len__(self) -> int:
        return len(self.text)
    
    def __getitem__(self, index: int) -> str:
        return self.text[index]

def tokenize(batch: List[str]) -> BatchEncoding:
    return tokenizer(
        batch,
        return_tensors='pt',
        padding=True,
        truncation=True
    )

def get_embeddings(text: List[str], batch_size: int = BATCH_SIZE) -> np.ndarray:
    dataset = TextDataset(text)
    dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=tokenize)

    all_embeddings = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc='Извлечение эмбеддингов'):
            batch = batch.to(device)
            embeddings = model_bert(**batch).last_hidden_state[:, 0, :]

            if device.type == 'cuda':
                all_embeddings.append(embeddings.cpu())
            else: 
                all_embeddings.append(embeddings)
    
    return torch.cat(all_embeddings, dim=0).numpy()

In [27]:
embeddings_bert = get_embeddings(mini_df['text'])

print('Форма эмбеддингов BERT', embeddings_bert.shape)

Извлечение эмбеддингов: 100%|██████████| 79/79 [02:15<00:00,  1.71s/it]

Форма эмбеддингов BERT (2500, 768)





In [28]:
X_train_full, X_test, y_train_full, y_test = train_test_split(
                                                embeddings_bert, mini_df['toxic'],
                                                random_state=RANDOM_STATE, 
                                                stratify=mini_df['toxic'], test_size=TEST_SIZE)
X_train, X_val, y_train, y_val = train_test_split(
                                    X_train_full, y_train_full,
                                    random_state=RANDOM_STATE,
                                    stratify=y_train_full, test_size=TEST_SIZE)

### Обучение моделей

In [29]:
study_logreg_bert = optuna.create_study(direction='maximize')
study_logreg_bert.optimize(obj_logreg, n_trials=10)

study_rfc_bert = optuna.create_study(direction='maximize')
study_rfc_bert.optimize(obj_rfc, n_trials=40)

study_lgbmc_bert = optuna.create_study(direction='maximize')
study_lgbmc_bert.optimize(obj_lgbmc, n_trials=40)

[I 2025-07-05 21:40:29,497] A new study created in memory with name: no-name-8f21abcb-8555-4343-80ff-0a0d26ee0ce5
[I 2025-07-05 21:40:29,597] Trial 0 finished with value: 0.9411764705882353 and parameters: {'C': 0.37826947792983645}. Best is trial 0 with value: 0.9411764705882353.
[I 2025-07-05 21:40:29,648] Trial 1 finished with value: 0.9411764705882353 and parameters: {'C': 0.4157276879938434}. Best is trial 0 with value: 0.9411764705882353.
[I 2025-07-05 21:40:29,701] Trial 2 finished with value: 0.9411764705882353 and parameters: {'C': 1.3279524711005537}. Best is trial 0 with value: 0.9411764705882353.
[I 2025-07-05 21:40:29,762] Trial 3 finished with value: 0.9411764705882353 and parameters: {'C': 1.8456196387850619}. Best is trial 0 with value: 0.9411764705882353.
[I 2025-07-05 21:40:29,821] Trial 4 finished with value: 0.9411764705882353 and parameters: {'C': 1.9531960411725946}. Best is trial 0 with value: 0.9411764705882353.
[I 2025-07-05 21:40:29,888] Trial 5 finished with 

Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[19]	valid_0's auc: 0.998098


[I 2025-07-05 21:41:09,015] Trial 0 finished with value: 0.9523809523809523 and parameters: {'learning_rate': 0.2634543850813775, 'num_leaves': 1347, 'max_depth': 10, 'min_child_samples': 58, 'subsample': 0.9501748218594166, 'colsample_bytree': 0.9557901608980008}. Best is trial 0 with value: 0.9523809523809523.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[43]	valid_0's auc: 0.998369


[I 2025-07-05 21:41:11,550] Trial 1 finished with value: 0.9411764705882353 and parameters: {'learning_rate': 0.1263274979911734, 'num_leaves': 2161, 'max_depth': 9, 'min_child_samples': 86, 'subsample': 0.8038516859766771, 'colsample_bytree': 0.5693906357726732}. Best is trial 0 with value: 0.9523809523809523.


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:41:12,452] Trial 2 finished with value: 0.9523809523809523 and parameters: {'learning_rate': 0.2925611958442956, 'num_leaves': 737, 'max_depth': 3, 'min_child_samples': 67, 'subsample': 0.9666588498906995, 'colsample_bytree': 0.888240428312266}. Best is trial 0 with value: 0.9523809523809523.


Early stopping, best iteration is:
[10]	valid_0's auc: 0.998098
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:41:13,745] Trial 3 finished with value: 0.9397590361445783 and parameters: {'learning_rate': 0.1892321658487759, 'num_leaves': 83, 'max_depth': 11, 'min_child_samples': 49, 'subsample': 0.7955882158837679, 'colsample_bytree': 0.640836909990516}. Best is trial 0 with value: 0.9523809523809523.


Early stopping, best iteration is:
[21]	valid_0's auc: 0.998302
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:41:15,572] Trial 4 finished with value: 0.0 and parameters: {'learning_rate': 0.017788200393072957, 'num_leaves': 813, 'max_depth': 10, 'min_child_samples': 91, 'subsample': 0.9438166955992039, 'colsample_bytree': 0.8889905617447076}. Best is trial 0 with value: 0.9523809523809523.


Early stopping, best iteration is:
[17]	valid_0's auc: 0.998471
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:41:18,753] Trial 5 finished with value: 0.9382716049382716 and parameters: {'learning_rate': 0.02104832460024622, 'num_leaves': 110, 'max_depth': 14, 'min_child_samples': 49, 'subsample': 0.5672415414226113, 'colsample_bytree': 0.8747062166329755}. Best is trial 0 with value: 0.9523809523809523.


Early stopping, best iteration is:
[69]	valid_0's auc: 0.998505
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:41:20,185] Trial 6 finished with value: 0.8947368421052632 and parameters: {'learning_rate': 0.09434657001322222, 'num_leaves': 352, 'max_depth': 8, 'min_child_samples': 12, 'subsample': 0.642601516808554, 'colsample_bytree': 0.4306121333106867}. Best is trial 0 with value: 0.9523809523809523.


Early stopping, best iteration is:
[8]	valid_0's auc: 0.998641
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[60]	valid_0's auc: 0.998369


[I 2025-07-05 21:41:23,721] Trial 7 finished with value: 0.9647058823529412 and parameters: {'learning_rate': 0.11097156431537976, 'num_leaves': 2458, 'max_depth': 10, 'min_child_samples': 88, 'subsample': 0.5164292144507693, 'colsample_bytree': 0.8919736717475896}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:41:25,020] Trial 8 finished with value: 0.926829268292683 and parameters: {'learning_rate': 0.09000844686263894, 'num_leaves': 416, 'max_depth': 10, 'min_child_samples': 65, 'subsample': 0.47310307400665214, 'colsample_bytree': 0.4448716482760691}. Best is trial 7 with value: 0.9647058823529412.


Early stopping, best iteration is:
[25]	valid_0's auc: 0.998166
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:41:26,049] Trial 9 finished with value: 0.9285714285714286 and parameters: {'learning_rate': 0.22628079177504973, 'num_leaves': 770, 'max_depth': 3, 'min_child_samples': 23, 'subsample': 0.5994819152837563, 'colsample_bytree': 0.9034499776671926}. Best is trial 7 with value: 0.9647058823529412.


Early stopping, best iteration is:
[21]	valid_0's auc: 0.997894
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[101]	valid_0's auc: 0.998302


[I 2025-07-05 21:41:29,043] Trial 10 finished with value: 0.9647058823529412 and parameters: {'learning_rate': 0.17239766356584701, 'num_leaves': 2857, 'max_depth': 6, 'min_child_samples': 99, 'subsample': 0.47711515405895, 'colsample_bytree': 0.786290902603123}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[27]	valid_0's auc: 0.99803


[I 2025-07-05 21:41:32,268] Trial 11 finished with value: 0.9523809523809523 and parameters: {'learning_rate': 0.15955604326232883, 'num_leaves': 2941, 'max_depth': 6, 'min_child_samples': 98, 'subsample': 0.40208580936400806, 'colsample_bytree': 0.7646105444020879}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.998234


[I 2025-07-05 21:41:35,276] Trial 12 finished with value: 0.9285714285714286 and parameters: {'learning_rate': 0.1877374808266274, 'num_leaves': 2975, 'max_depth': 6, 'min_child_samples': 80, 'subsample': 0.515769487713045, 'colsample_bytree': 0.7765362757825128}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[81]	valid_0's auc: 0.998437


[I 2025-07-05 21:41:38,226] Trial 13 finished with value: 0.9523809523809523 and parameters: {'learning_rate': 0.08420084449861258, 'num_leaves': 2371, 'max_depth': 6, 'min_child_samples': 100, 'subsample': 0.40942188991914125, 'colsample_bytree': 0.784316628267664}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[23]	valid_0's auc: 0.998709


[I 2025-07-05 21:41:41,409] Trial 14 finished with value: 0.9397590361445783 and parameters: {'learning_rate': 0.15590599664319665, 'num_leaves': 2374, 'max_depth': 13, 'min_child_samples': 75, 'subsample': 0.6881885304550613, 'colsample_bytree': 0.9831445282541971}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[23]	valid_0's auc: 0.998234


[I 2025-07-05 21:41:43,681] Trial 15 finished with value: 0.9382716049382716 and parameters: {'learning_rate': 0.061885947026937706, 'num_leaves': 1888, 'max_depth': 7, 'min_child_samples': 41, 'subsample': 0.4957129068838566, 'colsample_bytree': 0.6818158031526547}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[26]	valid_0's auc: 0.998166


[I 2025-07-05 21:41:46,545] Trial 16 finished with value: 0.9285714285714286 and parameters: {'learning_rate': 0.21376963543156158, 'num_leaves': 2652, 'max_depth': 12, 'min_child_samples': 76, 'subsample': 0.5446979561774591, 'colsample_bytree': 0.8195223118239529}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[27]	valid_0's auc: 0.998403


[I 2025-07-05 21:41:49,222] Trial 17 finished with value: 0.9411764705882353 and parameters: {'learning_rate': 0.12201993871809648, 'num_leaves': 1664, 'max_depth': 15, 'min_child_samples': 92, 'subsample': 0.7582234138068462, 'colsample_bytree': 0.598489030103935}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[21]	valid_0's auc: 0.998437


[I 2025-07-05 21:41:51,209] Trial 18 finished with value: 0.9411764705882353 and parameters: {'learning_rate': 0.12522184923450466, 'num_leaves': 2630, 'max_depth': 4, 'min_child_samples': 85, 'subsample': 0.4652497438250021, 'colsample_bytree': 0.708975667554347}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[12]	valid_0's auc: 0.998369


[I 2025-07-05 21:41:53,509] Trial 19 finished with value: 0.9523809523809523 and parameters: {'learning_rate': 0.240190800275787, 'num_leaves': 2012, 'max_depth': 8, 'min_child_samples': 35, 'subsample': 0.5976095011119135, 'colsample_bytree': 0.8325524884043297}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[14]	valid_0's auc: 0.997962


[I 2025-07-05 21:41:54,697] Trial 20 finished with value: 0.9285714285714286 and parameters: {'learning_rate': 0.1824362188030948, 'num_leaves': 1284, 'max_depth': 4, 'min_child_samples': 66, 'subsample': 0.6738687807799387, 'colsample_bytree': 0.5107917339531232}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[4]	valid_0's auc: 0.997588


[I 2025-07-05 21:41:56,701] Trial 21 finished with value: 0.926829268292683 and parameters: {'learning_rate': 0.2790522300397588, 'num_leaves': 1328, 'max_depth': 11, 'min_child_samples': 58, 'subsample': 0.905507452132824, 'colsample_bytree': 0.993472402018228}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[10]	valid_0's auc: 0.998268


[I 2025-07-05 21:41:58,866] Trial 22 finished with value: 0.9397590361445783 and parameters: {'learning_rate': 0.2603957878880654, 'num_leaves': 1644, 'max_depth': 9, 'min_child_samples': 93, 'subsample': 0.8619119460720579, 'colsample_bytree': 0.9450397539276231}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[45]	valid_0's auc: 0.99769


[I 2025-07-05 21:42:00,851] Trial 23 finished with value: 0.9647058823529412 and parameters: {'learning_rate': 0.2539504168696469, 'num_leaves': 1091, 'max_depth': 12, 'min_child_samples': 31, 'subsample': 0.7310408642631182, 'colsample_bytree': 0.9363864239141154}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[20]	valid_0's auc: 0.998505


[I 2025-07-05 21:42:03,948] Trial 24 finished with value: 0.925 and parameters: {'learning_rate': 0.05506779335705626, 'num_leaves': 1094, 'max_depth': 12, 'min_child_samples': 28, 'subsample': 0.7362915150144906, 'colsample_bytree': 0.8340235936704397}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[7]	valid_0's auc: 0.998369


[I 2025-07-05 21:42:06,943] Trial 25 finished with value: 0.926829268292683 and parameters: {'learning_rate': 0.21624298670474895, 'num_leaves': 2615, 'max_depth': 13, 'min_child_samples': 27, 'subsample': 0.6281556498028049, 'colsample_bytree': 0.7167241819230419}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[16]	valid_0's auc: 0.998302


[I 2025-07-05 21:42:10,885] Trial 26 finished with value: 0.963855421686747 and parameters: {'learning_rate': 0.14107110663479738, 'num_leaves': 2781, 'max_depth': 8, 'min_child_samples': 6, 'subsample': 0.44138164541607694, 'colsample_bytree': 0.8660421143184558}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[26]	valid_0's auc: 0.998369


[I 2025-07-05 21:42:13,353] Trial 27 finished with value: 0.9397590361445783 and parameters: {'learning_rate': 0.17655801706391885, 'num_leaves': 2373, 'max_depth': 5, 'min_child_samples': 40, 'subsample': 0.5199128482132904, 'colsample_bytree': 0.9217579432236903}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[11]	valid_0's auc: 0.999117


[I 2025-07-05 21:42:15,346] Trial 28 finished with value: 0.9647058823529412 and parameters: {'learning_rate': 0.24046109099701377, 'num_leaves': 1072, 'max_depth': 11, 'min_child_samples': 18, 'subsample': 0.7238067807348587, 'colsample_bytree': 0.7497801273932366}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[39]	valid_0's auc: 0.998098


[I 2025-07-05 21:42:17,693] Trial 29 finished with value: 0.9523809523809523 and parameters: {'learning_rate': 0.2614559573059055, 'num_leaves': 2138, 'max_depth': 9, 'min_child_samples': 55, 'subsample': 0.5674503208111924, 'colsample_bytree': 0.9524697521916953}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[71]	valid_0's auc: 0.99769


[I 2025-07-05 21:42:19,587] Trial 30 finished with value: 0.9647058823529412 and parameters: {'learning_rate': 0.2992936717509548, 'num_leaves': 1750, 'max_depth': 12, 'min_child_samples': 72, 'subsample': 0.8543437458926757, 'colsample_bytree': 0.8053736498288943}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:42:21,518] Trial 31 finished with value: 0.9534883720930233 and parameters: {'learning_rate': 0.24274112273498674, 'num_leaves': 1033, 'max_depth': 11, 'min_child_samples': 19, 'subsample': 0.7239020904719183, 'colsample_bytree': 0.7385988702917484}. Best is trial 7 with value: 0.9647058823529412.


Early stopping, best iteration is:
[16]	valid_0's auc: 0.998302
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[18]	valid_0's auc: 0.998302


[I 2025-07-05 21:42:23,514] Trial 32 finished with value: 0.9647058823529412 and parameters: {'learning_rate': 0.24380322733148835, 'num_leaves': 1455, 'max_depth': 10, 'min_child_samples': 17, 'subsample': 0.7866419163424446, 'colsample_bytree': 0.6571944164916933}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[17]	valid_0's auc: 0.99803


[I 2025-07-05 21:42:25,704] Trial 33 finished with value: 0.9397590361445783 and parameters: {'learning_rate': 0.27572631873968784, 'num_leaves': 1079, 'max_depth': 13, 'min_child_samples': 5, 'subsample': 0.654297800478449, 'colsample_bytree': 0.7586063100336431}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:42:28,188] Trial 34 finished with value: 0.9382716049382716 and parameters: {'learning_rate': 0.11166616222947393, 'num_leaves': 548, 'max_depth': 11, 'min_child_samples': 34, 'subsample': 0.7040215384395291, 'colsample_bytree': 0.8510549209561489}. Best is trial 7 with value: 0.9647058823529412.


Early stopping, best iteration is:
[11]	valid_0's auc: 0.998369
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[12]	valid_0's auc: 0.998166


[I 2025-07-05 21:42:30,532] Trial 35 finished with value: 0.9397590361445783 and parameters: {'learning_rate': 0.2050966977695905, 'num_leaves': 1218, 'max_depth': 10, 'min_child_samples': 13, 'subsample': 0.8289086016937657, 'colsample_bytree': 0.9262358689121787}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[33]	valid_0's auc: 0.998505


[I 2025-07-05 21:42:33,219] Trial 36 finished with value: 0.9411764705882353 and parameters: {'learning_rate': 0.1652613497658827, 'num_leaves': 1523, 'max_depth': 14, 'min_child_samples': 87, 'subsample': 0.7857787195802586, 'colsample_bytree': 0.8000529029795209}. Best is trial 7 with value: 0.9647058823529412.


Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:42:34,688] Trial 37 finished with value: 0.9523809523809523 and parameters: {'learning_rate': 0.20104907631313518, 'num_leaves': 947, 'max_depth': 9, 'min_child_samples': 43, 'subsample': 0.7590267477674737, 'colsample_bytree': 0.6192484446797765}. Best is trial 7 with value: 0.9647058823529412.


Early stopping, best iteration is:
[14]	valid_0's auc: 0.997962
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:42:36,677] Trial 38 finished with value: 0.9397590361445783 and parameters: {'learning_rate': 0.1428942141058671, 'num_leaves': 843, 'max_depth': 7, 'min_child_samples': 83, 'subsample': 0.9909313486730564, 'colsample_bytree': 0.8865733273281082}. Best is trial 7 with value: 0.9647058823529412.


Early stopping, best iteration is:
[32]	valid_0's auc: 0.998641
Training until validation scores don't improve for 100 rounds


[I 2025-07-05 21:42:38,319] Trial 39 finished with value: 0.9397590361445783 and parameters: {'learning_rate': 0.27793877039612064, 'num_leaves': 650, 'max_depth': 11, 'min_child_samples': 61, 'subsample': 0.6115989006527659, 'colsample_bytree': 0.9091416657044236}. Best is trial 7 with value: 0.9647058823529412.


Early stopping, best iteration is:
[21]	valid_0's auc: 0.997282


In [30]:
print(f'Метрика f1 на модели LR: {study_logreg_bert.best_value:.4f}')
print(f'Метрика f1 на модели RFC: {study_rfc_bert.best_value:.4f}')
print(f'Метрика f1 на модели LGBMC: {study_lgbmc_bert.best_value:.4f}')

Метрика f1 на модели LR: 0.9412
Метрика f1 на модели RFC: 0.9762
Метрика f1 на модели LGBMC: 0.9647


Лучше всего на валидационной выборке показала себя модель Random Forest. Метрика f1 равна 0.9762.

## Предсказание на тестовой выборке

In [32]:
model_final = RandomForestClassifier(**study_rfc_bert.best_params).fit(X_train, y_train)

test_preds = model_final.predict(X_test)

f1_final = f1_score(y_test, test_preds, average='binary')
print(f'Метрика f1 на тестовой выборке {f1_final:.4f}')

Метрика f1 на тестовой выборке 0.9434


Метрика f1 на тестовой выборке больше 0.7. Задачу можно считать выполненной. Метрика лучшей модели равна: 0.94

## Вывод

- Были загружены и обработаны данные
- Текста преобразованы в эмбеддинги с помощью SentenceTransformer. Обучены три модели: LR, RFC, LGBMC. Метрика на валидационнной выборке неудовлетворительная.
- Токенизировали и преобразовали данные с помощью модели BERT. Те же самые модели были обучены. Все три модели на валидационной выборке получили метрику выше 0.75. Лучше всего показала себя модель Random Forest.
- Предсказана метрика на тестовой выборке с помощью модели Random Forest. F1 равно 0.94, что больше 0.75