<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп» с BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

### Импорты

In [None]:
# !pip install pymystem3 -q
# !pip install transformers -q
# !pip install emoji -q
# !pip install imblearn -q
# !pip install sentence_transformers -q

In [4]:
# Стандартная библиотека
import logging
import warnings
import re
import random
import os

# Научные и аналитические библиотеки
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.metrics import f1_score, precision_recall_curve
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Библиотеки для бустинга
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Работа с текстом
import emoji

# PyTorch и Transformers
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, Subset
from transformers import BertTokenizer, BertConfig, BertModel
import transformers
from sentence_transformers import SentenceTransformer

# Оптимизация гиперпараметров
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Прочие полезные инструменты
from tqdm import tqdm, notebook
tqdm.pandas()

# Настройка логирования
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger(__name__)

# Отключение предупреждений
warnings.filterwarnings("ignore")


In [5]:
# # --- BERT ---
# tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# config = BertConfig.from_pretrained("bert-base-uncased")
# model = BertModel.from_pretrained("bert-base-uncased", config=config)

# # device
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)
# model.eval()

# # torch.compile — ТОЛЬКО если поддерживается
# if device.type == "cuda":
#     try:
#         model = torch.compile(model)
#         logger.info("torch.compile активен")
#     except RuntimeError as e:
#         logger.warning(f"torch.compile отключен: {e}")

# logger.info(device)

device = "cuda" if torch.cuda.is_available() else "cpu"

model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device=device
)

logger.info(f"SentenceTransformer on {device}")

2026-01-07 22:43:19 [INFO] sentence_transformers.SentenceTransformer: Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
2026-01-07 22:43:34 [INFO] __main__: SentenceTransformer on cuda


### Константы

In [None]:
MAX_LENGTH = 256
BATCH_SIZE = 64
CV = 3
N_OPTUNA = 2
TEST_SIZE = 0.2
RANDOM_STATE = 20
N_EPOCHS = 10000
EARLY_STOP = 5

def seed_everything(seed=RANDOM_STATE):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.deterministic = False

seed_everything()

In [7]:
# data = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv', index_col=[0])
df = pd.read_csv('./data/toxic_comments.csv', index_col=[0])
display(df.head())

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


### Подготовка текста

In [8]:
def bert_clean(text: str) -> str:
    """
    Очистка текста для BERT:
    - убираем ссылки,
    - заменяем несколько пробелов на один,
    - убираем эмодзи,
    - приводим к нижнему регистру,
    - обрезаем лишние пробелы по краям.
    """
    if not isinstance(text, str):
        return ""
    
    text = text.lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

# применяем к датафрейму
df['text_clean'] = df['text'].progress_apply(bert_clean)

100%|██████████| 159292/159292 [00:02<00:00, 54486.27it/s]


In [9]:
display(df.sample(5))

Unnamed: 0,text,toxic,text_clean
74297,"Adoption?\nHi, Katie. I am also interested in ...",0,"adoption? hi, katie. i am also interested in a..."
53437,"""\n\n WP:APPLE's Backlog Elimination Drive is ...",0,""" wp:apple's backlog elimination drive is over..."
68821,Your recent removal of some text at Global Pos...,0,your recent removal of some text at global pos...
14003,"No, we don't need anyone to say explicitly tha...",0,"no, we don't need anyone to say explicitly tha..."
155348,"Piss off, she is an ignorant bitch.",1,"piss off, she is an ignorant bitch."


In [10]:
df_train, df_test = train_test_split(
    df,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=df["toxic"]
)

# сброс индексов
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

logger.info(f"Train size: {len(df_train)}, Test size: {len(df_test)}")

2026-01-07 22:43:38 [INFO] __main__: Train size: 127433, Test size: 31859


In [11]:
# def bert_mean_embeddings(df_part, tokenizer, model, device, batch_size, max_length, desc):
#     all_embeddings = []

#     model.eval()

#     for i in tqdm(range(0, len(df_part), batch_size), desc=desc):
#         batch_texts = df_part['text_clean'].iloc[i:i + batch_size].tolist()

#         encoded = tokenizer(
#             batch_texts,
#             padding=True,          # без padding до max_length
#             truncation=True,
#             max_length=max_length,
#             return_tensors='pt'
#         )

#         input_ids = encoded["input_ids"].to(device)
#         attention_mask = encoded["attention_mask"].to(device)

#         with torch.no_grad():
#             outputs = model(
#                 input_ids=input_ids,
#                 attention_mask=attention_mask,
#                 return_dict=True
#             )

#             # CLS embedding
#             embeddings = outputs.last_hidden_state[:, 0, :]

#             # L2-нормализация (важно для линейных моделей)
#             embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

#         all_embeddings.append(embeddings.cpu().numpy())

#     return np.vstack(all_embeddings)

def get_sentence_embeddings(df_part, model, batch_size, desc):
    return model.encode(
        df_part["text_clean"].tolist(),
        batch_size=batch_size,
        show_progress_bar=True,
        normalize_embeddings=True
    )

In [12]:
# пути для сохранения эмбеддингов
train_emb_path = "./data/X_train_emb.npy"
test_emb_path = "./data/X_test_emb.npy"
train_labels_path = "./data/y_train.npy"
test_labels_path = "./data/y_test.npy"

# ===== TRAIN =====
if os.path.exists(train_emb_path) and os.path.exists(train_labels_path):
    X_train = np.load(train_emb_path)
    y_train = np.load(train_labels_path)
    logger.info(f"Загружены кэшированные эмбеддинги TRAIN: {X_train.shape}")
else:
    X_train = get_sentence_embeddings(
        df_train,
        model,
        BATCH_SIZE,
        desc="Train embeddings"
    )
    y_train = df_train["toxic"].values
    np.save(train_emb_path, X_train)
    np.save(train_labels_path, y_train)
    logger.info(f"Сгенерированы и сохранены эмбеддинги TRAIN: {X_train.shape}")

# ===== TEST =====
if os.path.exists(test_emb_path) and os.path.exists(test_labels_path):
    X_test = np.load(test_emb_path)
    y_test = np.load(test_labels_path)
    logger.info(f"Загружены кэшированные эмбеддинги TEST: {X_test.shape}")
else:
    X_test = get_sentence_embeddings(
        df_test,
        model,
        BATCH_SIZE,
        desc="Test embeddings"
    )
    y_test = df_test["toxic"].values
    np.save(test_emb_path, X_test)
    np.save(test_labels_path, y_test)
    logger.info(f"Сгенерированы и сохранены эмбеддинги TEST: {X_test.shape}")

Batches: 100%|██████████| 1992/1992 [00:39<00:00, 50.56it/s] 
2026-01-07 22:44:21 [INFO] __main__: Сгенерированы и сохранены эмбеддинги TRAIN: (127433, 384)
Batches: 100%|██████████| 498/498 [00:09<00:00, 49.87it/s] 
2026-01-07 22:44:31 [INFO] __main__: Сгенерированы и сохранены эмбеддинги TEST: (31859, 384)


## Обучение

In [13]:
# scaler = StandardScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)

In [14]:
# Проверяем баланс классов
logger.info(f"Распределение классов в train: {np.bincount(y_train)}")
logger.info(f"Доля positive класса: {y_train.mean():.3f}")

2026-01-07 22:44:31 [INFO] __main__: Распределение классов в train: [114484  12949]
2026-01-07 22:44:31 [INFO] __main__: Доля positive класса: 0.102


In [18]:
best_models = {}

models_to_run = ["LogisticRegression", "RidgeClassifier", "LGBM"]

for model_name in models_to_run:

    logger.info(f"===== START MODEL: {model_name} =====")

    def objective(trial):

        # ===== модель и гиперпараметры =====
        if model_name == "LogisticRegression":
            clf = LogisticRegression(
                C=trial.suggest_float("C", 1e-3, 10.0, log=True),
                max_iter=2000,
                class_weight="balanced",
                n_jobs=-1,
                solver="lbfgs"
            )

        elif model_name == "RidgeClassifier":
            clf = RidgeClassifier(
                alpha=trial.suggest_float("alpha", 1e-3, 10.0, log=True),
                class_weight="balanced"
            )

        elif model_name == "LGBM":
            clf = LGBMClassifier(
                n_estimators=trial.suggest_int("n_estimators", 100, 600),
                learning_rate=trial.suggest_float("learning_rate", 1e-2, 0.2, log=True),
                num_leaves=trial.suggest_int("num_leaves", 31, 255),
                min_child_samples=trial.suggest_int("min_child_samples", 5, 50),
                subsample=trial.suggest_float("subsample", 0.7, 1.0),
                colsample_bytree=trial.suggest_float("colsample_bytree", 0.7, 1.0),
                class_weight="balanced",
                n_jobs=-1,
                random_state=RANDOM_STATE,
                verbose=-1
            )

        cv = StratifiedKFold(
            n_splits=CV,
            shuffle=True,
            random_state=RANDOM_STATE
        )

        f1_scores = []

        for fold_idx, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
            X_tr, X_val = X_train[train_idx], X_train[val_idx]
            y_tr, y_val = y_train[train_idx], y_train[val_idx]

            clf.fit(X_tr, y_tr)

            if hasattr(clf, "predict_proba"):
                y_val_proba = clf.predict_proba(X_val)[:, 1]
                prec, rec, _ = precision_recall_curve(y_val, y_val_proba)
                f1 = np.max(2 * prec * rec / (prec + rec + 1e-9))
            else:
                f1 = f1_score(y_val, clf.predict(X_val))

            f1_scores.append(float(f1))

            # ===== PRUNER (ОБЯЗАТЕЛЬНО ВНУТРИ ФОЛДОВ) =====
            trial.report(np.mean(f1_scores), step=fold_idx)
            if trial.should_prune():
                logger.info(
                    f"[{model_name}] Trial {trial.number} PRUNED | "
                    f"Mean F1={np.mean(f1_scores):.4f}"
                )
                raise optuna.TrialPruned()

        mean_f1 = float(np.mean(f1_scores))

        logger.info(
            f"[{model_name}] Trial {trial.number} | "
            f"Mean F1={mean_f1:.4f}"
        )

        return mean_f1

    study = optuna.create_study(
        direction="maximize",
        sampler=TPESampler(seed=RANDOM_STATE),
        pruner=MedianPruner(
            n_startup_trials=2,
            n_warmup_steps=1,
            interval_steps=1
        )
    )

    study.optimize(objective, n_trials=N_OPTUNA)

    logger.info(
        f"===== END MODEL: {model_name} | "
        f"Best F1={study.best_value:.4f} ====="
    )

    best_models[model_name] = {
        "best_trial": study.best_trial,
        "best_value": study.best_value
    }


2026-01-07 22:53:44 [INFO] __main__: ===== START MODEL: LogisticRegression =====
2026-01-07 22:53:52 [INFO] __main__: [LogisticRegression] Trial 0 | Mean F1=0.7175
2026-01-07 22:53:59 [INFO] __main__: [LogisticRegression] Trial 1 | Mean F1=0.7283
2026-01-07 22:54:08 [INFO] __main__: [LogisticRegression] Trial 2 | Mean F1=0.7274
2026-01-07 22:54:15 [INFO] __main__: [LogisticRegression] Trial 3 | Mean F1=0.7263
2026-01-07 22:54:24 [INFO] __main__: [LogisticRegression] Trial 4 PRUNED | Mean F1=0.6385
2026-01-07 22:54:29 [INFO] __main__: [LogisticRegression] Trial 5 PRUNED | Mean F1=0.7219
2026-01-07 22:54:33 [INFO] __main__: [LogisticRegression] Trial 6 PRUNED | Mean F1=0.6952
2026-01-07 22:54:37 [INFO] __main__: [LogisticRegression] Trial 7 PRUNED | Mean F1=0.7109
2026-01-07 22:54:42 [INFO] __main__: [LogisticRegression] Trial 8 PRUNED | Mean F1=0.7195
2026-01-07 22:54:46 [INFO] __main__: [LogisticRegression] Trial 9 PRUNED | Mean F1=0.6652
2026-01-07 22:54:46 [INFO] __main__: ===== END 

KeyboardInterrupt: 

In [None]:
# # нейросеть
# # --- данные для CV (ТОЛЬКО train) ---
# X_cv = X_train.astype(np.float32)
# y_cv = y_train.astype(np.int64)

# # --- простая MLP ---
# class MLP(nn.Module):
#     def __init__(self, input_dim, hidden_dim1, hidden_dim2, dropout):
#         super().__init__()
#         self.model = nn.Sequential(
#             nn.Linear(input_dim, hidden_dim1),
#             nn.ReLU(),
#             nn.Dropout(dropout),
#             nn.Linear(hidden_dim1, 2)
#         )

#     def forward(self, x):
#         return self.model(x)

# # --- обучение одной модели с ранней остановкой ---
# def train_model(
#     model,
#     train_dataset,
#     val_dataset,
#     lr,
#     batch_size,
#     n_epochs=N_EPOCHS,
#     patience=EARLY_STOP
# ):
#     train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
#     val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

#     # ВАЖНО: веса считаем ТОЛЬКО по train части
#     y_train_local = train_dataset.tensors[1].numpy()
#     class_counts = np.bincount(y_train_local)
#     class_weights = class_counts.sum() / (2 * class_counts)
#     class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)

#     criterion = nn.CrossEntropyLoss(weight=class_weights)
#     optimizer = optim.Adam(model.parameters(), lr=lr)

#     best_val_f1 = 0.0
#     best_state = None
#     trigger = 0

#     model.to(device)

#     for epoch in range(n_epochs):
#         # --- train ---
#         model.train()
#         for xb, yb in train_loader:
#             xb, yb = xb.to(device), yb.to(device)
#             optimizer.zero_grad()
#             logits = model(xb)
#             loss = criterion(logits, yb)
#             loss.backward()
#             optimizer.step()

#         # --- validation ---
#         model.eval()
#         all_preds, all_labels = [], []

#         with torch.no_grad():
#             for xb, yb in val_loader:
#                 xb, yb = xb.to(device), yb.to(device)
#                 logits = model(xb)
#                 preds = torch.argmax(logits, dim=1)
#                 all_preds.append(preds.cpu())
#                 all_labels.append(yb.cpu())

#         all_preds = torch.cat(all_preds)
#         all_labels = torch.cat(all_labels)
#         val_f1 = f1_score(all_labels, all_preds)

#         if val_f1 > best_val_f1:
#             best_val_f1 = val_f1
#             best_state = model.state_dict()
#             trigger = 0
#         else:
#             trigger += 1
#             if trigger >= patience:
#                 break

#     model.load_state_dict(best_state)
#     return model, best_val_f1

# # --- Optuna objective с CV (БЕЗ УТЕЧЕК) ---
# def objective_mlp(trial):
#     hidden_dim1 = trial.suggest_int("hidden_dim1", 512, 1024)
#     hidden_dim2 = trial.suggest_int("hidden_dim2", 256, 512)
#     dropout = trial.suggest_float("dropout", 0.05, 0.15)
#     lr = trial.suggest_float("lr", 1e-3, 1e-2, log=True)
#     batch_size = trial.suggest_categorical("batch_size", [64, 128, 256])

#     cv = StratifiedKFold(n_splits=CV, shuffle=True, random_state=RANDOM_STATE)
#     f1_scores = []

#     for train_idx, val_idx in cv.split(X_cv, y_cv):
#         train_ds = TensorDataset(
#             torch.from_numpy(X_cv[train_idx]),
#             torch.from_numpy(y_cv[train_idx])
#         )
#         val_ds = TensorDataset(
#             torch.from_numpy(X_cv[val_idx]),
#             torch.from_numpy(y_cv[val_idx])
#         )

#         model = MLP(
#             input_dim=X_cv.shape[1],
#             hidden_dim1=hidden_dim1,
#             hidden_dim2=hidden_dim2,
#             dropout=dropout
#         )

#         _, val_f1 = train_model(
#             model,
#             train_ds,
#             val_ds,
#             lr=lr,
#             batch_size=batch_size
#         )

#         f1_scores.append(val_f1)

#     return float(np.mean(f1_scores))

# # --- запуск Optuna ---
# study = optuna.create_study(direction="maximize")
# study.optimize(objective_mlp, n_trials=N_OPTUNA)

# best_params = study.best_trial.params
# logger.info(f"Лучшие параметры: {best_params}")
# logger.info(f"F1 CV (train): {study.best_value:.4f}")

# # --- финальная модель: обучение НА ВСЁМ train ---
# final_model = MLP(
#     input_dim=X_train.shape[1],
#     hidden_dim1=best_params["hidden_dim1"],
#     hidden_dim2=best_params["hidden_dim2"],
#     dropout=best_params["dropout"]
# )

# full_train_dataset = TensorDataset(
#     torch.from_numpy(X_train.astype(np.float32)),
#     torch.from_numpy(y_train.astype(np.int64))
# )

# # используем train как val только для early stopping
# final_model, _ = train_model(
#     final_model,
#     full_train_dataset,
#     full_train_dataset,
#     lr=best_params["lr"],
#     batch_size=best_params["batch_size"]
# )

# # --- оценка на test (holdout) ---
# test_dataset = TensorDataset(
#     torch.from_numpy(X_test.astype(np.float32)),
#     torch.from_numpy(y_test.astype(np.int64))
# )

# test_loader = DataLoader(
#     test_dataset,
#     batch_size=best_params["batch_size"],
#     shuffle=False
# )

# final_model.eval()
# all_preds, all_labels = [], []

# with torch.no_grad():
#     for xb, yb in test_loader:
#         xb, yb = xb.to(device), yb.to(device)
#         logits = final_model(xb)
#         preds = torch.argmax(logits, dim=1)
#         all_preds.append(preds.cpu())
#         all_labels.append(yb.cpu())

# all_preds = torch.cat(all_preds)
# all_labels = torch.cat(all_labels)

# test_f1 = f1_score(all_labels, all_preds)

# logger.info(f"F1 test: {test_f1:.4f}")

## Выводы

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны