<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп» с BERT

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

### Импорты

In [1]:
# !pip install pymystem3 -q
# !pip install transformers -q
# !pip install emoji -q

In [2]:
# Стандартная библиотека
import logging
import warnings
import re
import random
import os

# Научные и аналитические библиотеки
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Библиотеки для бустинга
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Работа с текстом
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from pymystem3 import Mystem
import emoji

# PyTorch и Transformers
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, Subset
from transformers import BertTokenizer, BertConfig, BertModel
import transformers

# Оптимизация гиперпараметров
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Прочие полезные инструменты
from tqdm import tqdm, notebook
tqdm.pandas()

# Настройка логирования
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger(__name__)

# Отключение предупреждений
warnings.filterwarnings("ignore")


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# --- BERT ---
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
config = BertConfig.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased", config=config)

# stopwords
nltk.download("stopwords")
english_stopwords = nltk_stopwords.words("english")

# device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# torch.compile — ТОЛЬКО если поддерживается
if device.type == "cuda":
    try:
        model = torch.compile(model)
        logger.info("torch.compile активен")
    except RuntimeError as e:
        logger.warning(f"torch.compile отключен: {e}")

logger.info(device)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
2026-01-07 20:07:56 [INFO] __main__: cuda


### Константы

In [None]:
MAX_LENGTH = 512
BATCH_SIZE = 64
CV = 3
N_OPTUNA = 10
TEST_SIZE = 0.2
RANDOM_STATE = 20
N_EPOCHS = 10000
EARLY_STOP = 10

def seed_everything(seed=RANDOM_STATE):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_everything()

In [5]:
# data = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv', index_col=[0])
df = pd.read_csv('./data/toxic_comments.csv', index_col=[0])
display(df.head())

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


### Подготовка текста

In [6]:
m = Mystem()
corpus = df['text'].values.astype('U')

In [7]:
def bert_clean(text: str) -> str:
    """
    Очистка текста для BERT:
    - убираем ссылки,
    - заменяем несколько пробелов на один,
    - убираем эмодзи,
    - приводим к нижнему регистру,
    - обрезаем лишние пробелы по краям.
    """
    if not isinstance(text, str):
        return ""
    
    # убираем ссылки
    text = re.sub(r"http\S+", "", text)
    
    # заменяем несколько пробелов на один
    text = re.sub(r"\s+", " ", text)
    
    # убираем эмодзи
    text = emoji.replace_emoji(text, replace="")
    
    # приводим к нижнему регистру
    text = text.lower()
    
    return text.strip()

# применяем к датафрейму
df['text_clean'] = df['text'].progress_apply(bert_clean)

100%|██████████| 159292/159292 [00:42<00:00, 3780.41it/s]


In [8]:
display(df.sample(5))

Unnamed: 0,text,toxic,text_clean
74297,"Adoption?\nHi, Katie. I am also interested in ...",0,"adoption? hi, katie. i am also interested in a..."
53437,"""\n\n WP:APPLE's Backlog Elimination Drive is ...",0,""" wp:apple's backlog elimination drive is over..."
68821,Your recent removal of some text at Global Pos...,0,your recent removal of some text at global pos...
14003,"No, we don't need anyone to say explicitly tha...",0,"no, we don't need anyone to say explicitly tha..."
155348,"Piss off, she is an ignorant bitch.",1,"piss off, she is an ignorant bitch."


In [9]:
df_train, df_test = train_test_split(
    df,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=df["toxic"]
)

# сброс индексов
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

logger.info(f"Train size: {len(df_train)}, Test size: {len(df_test)}")

2026-01-07 20:08:40 [INFO] __main__: Train size: 127433, Test size: 31859


In [10]:
def bert_mean_embeddings(df_part, tokenizer, model, device, batch_size, max_length, desc):
    all_embeddings = []

    for i in tqdm(range(0, len(df_part), batch_size), desc=desc):
        batch_texts = df_part['text_clean'].iloc[i:i + batch_size].tolist()

        encoded = tokenizer(
            batch_texts,
            add_special_tokens=True,
            padding='max_length',
            truncation=True,
            max_length=max_length,
            return_tensors='pt'
        )

        input_ids = encoded['input_ids'].to(device)
        attention_mask = encoded['attention_mask'].to(device)

        with torch.no_grad(), torch.cuda.amp.autocast(enabled=(device.type == "cuda")):
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)

        last_hidden = outputs.last_hidden_state
        mask = attention_mask.unsqueeze(-1)

        embeddings = (last_hidden * mask).sum(dim=1) / mask.sum(dim=1)
        all_embeddings.append(embeddings.cpu().numpy())

    return np.concatenate(all_embeddings, axis=0)

In [12]:
# пути для сохранения эмбеддингов
train_emb_path = "./data/X_train_emb.npy"
test_emb_path = "./data/X_test_emb.npy"
train_labels_path = "./data/y_train.npy"
test_labels_path = "./data/y_test.npy"

# ===== TRAIN =====
if os.path.exists(train_emb_path) and os.path.exists(train_labels_path):
    X_train = np.load(train_emb_path)
    y_train = np.load(train_labels_path)
    logger.info(f"Загружены кэшированные эмбеддинги TRAIN: {X_train.shape}")
else:
    X_train = bert_mean_embeddings(
        df_train,
        tokenizer,
        model,
        device,
        BATCH_SIZE,
        MAX_LENGTH,
        desc="Train embeddings"
    )
    y_train = df_train["toxic"].values
    np.save(train_emb_path, X_train)
    np.save(train_labels_path, y_train)
    logger.info(f"Сгенерированы и сохранены эмбеддинги TRAIN: {X_train.shape}")

# ===== TEST =====
if os.path.exists(test_emb_path) and os.path.exists(test_labels_path):
    X_test = np.load(test_emb_path)
    y_test = np.load(test_labels_path)
    logger.info(f"Загружены кэшированные эмбеддинги TEST: {X_test.shape}")
else:
    X_test = bert_mean_embeddings(
        df_test,
        tokenizer,
        model,
        device,
        BATCH_SIZE,
        MAX_LENGTH,
        desc="Test embeddings"
    )
    y_test = df_test["toxic"].values
    np.save(test_emb_path, X_test)
    np.save(test_labels_path, y_test)
    logger.info(f"Сгенерированы и сохранены эмбеддинги TEST: {X_test.shape}")

Train embeddings: 100%|██████████| 1992/1992 [09:03<00:00,  3.66it/s]
2026-01-07 20:18:01 [INFO] __main__: Сгенерированы и сохранены эмбеддинги TRAIN: (127433, 768)
Test embeddings: 100%|██████████| 498/498 [02:12<00:00,  3.75it/s]
2026-01-07 20:20:14 [INFO] __main__: Сгенерированы и сохранены эмбеддинги TEST: (31859, 768)


## Обучение

In [13]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
best_models = {}
models_to_run = ["LogisticRegression", "RidgeClassifier", "LGBM", "XGB"]

for model_name in models_to_run:

    def objective(trial):
        # --- выбираем модель и параметры ---
        if model_name == "LogisticRegression":
            C = trial.suggest_float("C", 1e-3, 10.0, log=True)
            clf = LogisticRegression(C=C, max_iter=1000, class_weight="balanced", n_jobs=-1)
        elif model_name == "RidgeClassifier":
            alpha = trial.suggest_float("alpha", 1e-3, 10.0, log=True)
            clf = RidgeClassifier(alpha=alpha, class_weight="balanced")
        elif model_name == "LGBM":
            n_estimators = trial.suggest_int("n_estimators", 50, 500)
            max_depth = trial.suggest_int("max_depth", 2, 10)
            learning_rate = trial.suggest_float("learning_rate", 1e-3, 0.3, log=True)
            clf = LGBMClassifier(
                n_estimators=n_estimators,
                max_depth=max_depth,
                learning_rate=learning_rate,
                class_weight="balanced",
                n_jobs=-1,
                verbose=-1
            )
        elif model_name == "XGB":
            n_estimators = trial.suggest_int("n_estimators", 50, 500)
            max_depth = trial.suggest_int("max_depth", 2, 10)
            learning_rate = trial.suggest_float("learning_rate", 1e-3, 0.3, log=True)
            clf = XGBClassifier(
                n_estimators=n_estimators,
                max_depth=max_depth,
                learning_rate=learning_rate,
                use_label_encoder=False,
                eval_metric="logloss",
                n_jobs=-1,
                verbosity=0
            )

        # --- кросс-валидация с прунингом ---
        cv = StratifiedKFold(n_splits=CV, shuffle=True, random_state=RANDOM_STATE)
        f1_scores = []

        for fold_idx, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
            X_tr, X_val = X_train[train_idx], X_train[val_idx]
            y_tr, y_val = y_train[train_idx], y_train[val_idx]

            clf.fit(X_tr, y_tr)
            y_pred = clf.predict(X_val)
            f1 = f1_score(y_val, y_pred)
            f1_scores.append(f1)

            # прунинг: если среднее по текущим фолдам хуже медианы, отбрасываем trial
            intermediate_value = np.mean(f1_scores)
            trial.report(intermediate_value, step=fold_idx)
            if trial.should_prune():
                raise optuna.TrialPruned()

        mean_f1 = np.mean(f1_scores)
        logger.info(f"{model_name}: Trial {trial.number}, средний F1 = {mean_f1:.4f}")
        return mean_f1

    study = optuna.create_study(
        direction="maximize",
        pruner=MedianPruner(n_startup_trials=2, n_warmup_steps=1, interval_steps=1),
        sampler=TPESampler(seed=RANDOM_STATE)
    )
    study.optimize(objective, n_trials=N_OPTUNA)

    best_models[model_name] = {
        "best_trial": study.best_trial,
        "best_value": study.best_value
    }

# --- выбираем лучшую модель среди всех ---
best_model_name = max(best_models, key=lambda k: best_models[k]["best_value"])
best_trial = best_models[best_model_name]["best_trial"]
params = best_trial.params

# --- финальная модель ---
if best_model_name == "LogisticRegression":
    final_model = LogisticRegression(**params, max_iter=1000, class_weight="balanced", n_jobs=-1)
elif best_model_name == "RidgeClassifier":
    final_model = RidgeClassifier(**params, class_weight="balanced")
elif best_model_name == "LGBM":
    final_model = LGBMClassifier(**params, class_weight="balanced", n_jobs=-1, verbose=-1)
elif best_model_name == "XGB":
    final_model = XGBClassifier(**params, use_label_encoder=False, eval_metric="logloss", n_jobs=-1, verbosity=0)

# обучение на всем трейне
final_model.fit(X_train, y_train)

# оценка на тесте
y_pred_test = final_model.predict(X_test)
test_f1 = f1_score(y_test, y_pred_test)

logger.info(f"Лучшая модель: {best_model_name}")
logger.info(f"Параметры: {params}")
logger.info(f"F1 на test: {test_f1:.4f}")


2026-01-07 20:20:49 [INFO] __main__: LogisticRegression: Trial 0, средний F1 = 0.6840
2026-01-07 20:21:17 [INFO] __main__: LogisticRegression: Trial 1, средний F1 = 0.6834
2026-01-07 20:21:19 [INFO] __main__: RidgeClassifier: Trial 0, средний F1 = 0.6529
2026-01-07 20:21:20 [INFO] __main__: RidgeClassifier: Trial 1, средний F1 = 0.6529
2026-01-07 20:22:55 [INFO] __main__: LGBM: Trial 0, средний F1 = 0.7412
2026-01-07 20:23:48 [INFO] __main__: LGBM: Trial 1, средний F1 = 0.6423
2026-01-07 20:34:38 [INFO] __main__: XGB: Trial 0, средний F1 = 0.7175
2026-01-07 20:38:24 [INFO] __main__: XGB: Trial 1, средний F1 = 0.6928
2026-01-07 20:39:34 [INFO] __main__: Лучшая модель: LGBM
2026-01-07 20:39:34 [INFO] __main__: Параметры: {'n_estimators': 315, 'max_depth': 10, 'learning_rate': 0.1615956698742364}
2026-01-07 20:39:34 [INFO] __main__: F1 на test: 0.7445


In [None]:
# нейросеть
# --- данные для CV (ТОЛЬКО train) ---
X_cv = X_train.astype(np.float32)
y_cv = y_train.astype(np.int64)

# --- простая MLP ---
class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, dropout):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim1),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim1, 2)
        )

    def forward(self, x):
        return self.model(x)

# --- обучение одной модели с ранней остановкой ---
def train_model(
    model,
    train_dataset,
    val_dataset,
    lr,
    batch_size,
    n_epochs=N_EPOCHS,
    patience=EARLY_STOP
):
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    # ВАЖНО: веса считаем ТОЛЬКО по train части
    y_train_local = train_dataset.tensors[1].numpy()
    class_counts = np.bincount(y_train_local)
    class_weights = class_counts.sum() / (2 * class_counts)
    class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)

    criterion = nn.CrossEntropyLoss(weight=class_weights)
    optimizer = optim.Adam(model.parameters(), lr=lr)

    best_val_f1 = 0.0
    best_state = None
    trigger = 0

    model.to(device)

    for epoch in range(n_epochs):
        # --- train ---
        model.train()
        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            logits = model(xb)
            loss = criterion(logits, yb)
            loss.backward()
            optimizer.step()

        # --- validation ---
        model.eval()
        all_preds, all_labels = [], []

        with torch.no_grad():
            for xb, yb in val_loader:
                xb, yb = xb.to(device), yb.to(device)
                logits = model(xb)
                preds = torch.argmax(logits, dim=1)
                all_preds.append(preds.cpu())
                all_labels.append(yb.cpu())

        all_preds = torch.cat(all_preds)
        all_labels = torch.cat(all_labels)
        val_f1 = f1_score(all_labels, all_preds)

        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            best_state = model.state_dict()
            trigger = 0
        else:
            trigger += 1
            if trigger >= patience:
                break

    model.load_state_dict(best_state)
    return model, best_val_f1

# --- Optuna objective с CV (БЕЗ УТЕЧЕК) ---
def objective_mlp(trial):
    hidden_dim1 = trial.suggest_int("hidden_dim1", 512, 1024)
    hidden_dim2 = trial.suggest_int("hidden_dim2", 256, 512)
    dropout = trial.suggest_float("dropout", 0.05, 0.15)
    lr = trial.suggest_float("lr", 1e-3, 1e-2, log=True)
    batch_size = trial.suggest_categorical("batch_size", [64, 128, 256])

    cv = StratifiedKFold(n_splits=CV, shuffle=True, random_state=RANDOM_STATE)
    f1_scores = []

    for train_idx, val_idx in cv.split(X_cv, y_cv):
        train_ds = TensorDataset(
            torch.from_numpy(X_cv[train_idx]),
            torch.from_numpy(y_cv[train_idx])
        )
        val_ds = TensorDataset(
            torch.from_numpy(X_cv[val_idx]),
            torch.from_numpy(y_cv[val_idx])
        )

        model = MLP(
            input_dim=X_cv.shape[1],
            hidden_dim1=hidden_dim1,
            hidden_dim2=hidden_dim2,
            dropout=dropout
        )

        _, val_f1 = train_model(
            model,
            train_ds,
            val_ds,
            lr=lr,
            batch_size=batch_size
        )

        f1_scores.append(val_f1)

    return float(np.mean(f1_scores))

# --- запуск Optuna ---
study = optuna.create_study(direction="maximize")
study.optimize(objective_mlp, n_trials=N_OPTUNA)

best_params = study.best_trial.params
logger.info(f"Лучшие параметры: {best_params}")
logger.info(f"F1 CV (train): {study.best_value:.4f}")

# --- финальная модель: обучение НА ВСЁМ train ---
final_model = MLP(
    input_dim=X_train.shape[1],
    hidden_dim1=best_params["hidden_dim1"],
    hidden_dim2=best_params["hidden_dim2"],
    dropout=best_params["dropout"]
)

full_train_dataset = TensorDataset(
    torch.from_numpy(X_train.astype(np.float32)),
    torch.from_numpy(y_train.astype(np.int64))
)

# используем train как val только для early stopping
final_model, _ = train_model(
    final_model,
    full_train_dataset,
    full_train_dataset,
    lr=best_params["lr"],
    batch_size=best_params["batch_size"]
)

# --- оценка на test (holdout) ---
test_dataset = TensorDataset(
    torch.from_numpy(X_test.astype(np.float32)),
    torch.from_numpy(y_test.astype(np.int64))
)

test_loader = DataLoader(
    test_dataset,
    batch_size=best_params["batch_size"],
    shuffle=False
)

final_model.eval()
all_preds, all_labels = [], []

with torch.no_grad():
    for xb, yb in test_loader:
        xb, yb = xb.to(device), yb.to(device)
        logits = final_model(xb)
        preds = torch.argmax(logits, dim=1)
        all_preds.append(preds.cpu())
        all_labels.append(yb.cpu())

all_preds = torch.cat(all_preds)
all_labels = torch.cat(all_labels)

test_f1 = f1_score(all_labels, all_preds)

logger.info(f"F1 test: {test_f1:.4f}")

## Выводы

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [ ]  Весь код выполняется без ошибок
- [ ]  Ячейки с кодом расположены в порядке исполнения
- [ ]  Данные загружены и подготовлены
- [ ]  Модели обучены
- [ ]  Значение метрики *F1* не меньше 0.75
- [ ]  Выводы написаны