# **DistilBERT**

По мере распространения трансферного обучения на основе предварительно обученных крупномасштабных моделей всё более острой становится проблема их эксплуатации в условиях ограниченных ресурсов. В предложенной для анализа [статье](https://arxiv.org/pdf/1910.01108) рассматривается модель DistilBERT (являющаяся уменьшенной моделью BERT).

Авторы статьи показали, что использование "дистилляции знаний" на этапе предварительного обучения позволяет уменьшить размер исходной модели BERT на 40% и ускорить работу на 60%, при этом сохранить 97% возможностей понимания языка.

#### **Что за модель [BERT](https://arxiv.org/pdf/1810.04805)?**

Ключевое нововведение данной модели на момент её появления (2019 год) - "двунаправленное" обучение трансформеров (в ранних моделях обучение происходило либо в одну сторону, либо сочетанием "односторонних"). Таким образом,  языковая модель с двунаправленным обучением способна достичь более глубокого понимания контекста и потока, чем однонаправленные.


#### **В чем заключается оптимизация DistilBERT?**

Прежде всего идёт упрощение архитектуры:
*   вместо 12 трансформерных слоев в упрощённой идет 6 слоев (берется 1 из каждых 2 предобученных блоков-энкодеров BERT)
*   отсутствуют сопоставление токенов и функция пулинга

Таким образом, вместо 110млн гиперпараметров оригинальной модели упрощенная DistilBERT использует 66млн

#### Ограничения упрощенной модели в отличие от оригинальной?

* Оригинальная модель BERT поддерживает 104 языка - DistilBERT работает только с английским (есть и русская версия, но суть не меняется - происходит обработка лишь одного языка)
* BERT использует две стратегии обучения: маскированное языковое моделирование (MLM) и прогнозирование следующего предложения (NSP) - в DistilBERT'e вторая стратегия отсутствует


Устанавливаем необходимые библиотеки

In [None]:
!pip install transformers torch datasets scikit-learn -q

In [None]:
import numpy as np

Загружаем модель DistilBERT

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Было рекомендовано использовать датасет IMDb, от рекомендаций не отказываемся

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")
train_data = dataset["train"]
test_data = dataset["test"]

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_test = test_data.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [None]:
tokenized_train.set_format("torch")
tokenized_test.set_format("torch")

In [None]:
import transformers
print(transformers.__version__)

4.51.1


In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": np.mean(predictions == labels)}

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    eval_strategy="epoch",
    logging_steps=100,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.1459,0.297203,0.92088
2,0.0757,0.310264,0.92992
3,0.0246,0.369375,0.93072


Epoch,Training Loss,Validation Loss


TrainOutput(global_step=4689, training_loss=0.08585143787160716, metrics={'train_runtime': 4450.1738, 'train_samples_per_second': 16.853, 'train_steps_per_second': 1.054, 'total_flos': 1.0067522297856e+16, 'train_loss': 0.08585143787160716, 'epoch': 3.0})

Видим, что модель переобучилась: очень хорошо предсказывает train, но validation loss сильно вырос - плохо

In [None]:
results = trainer.evaluate()
print(f"точность: {results['eval_accuracy']:.2f}")

точность: 0.93


Подготовка к использованию LSTM

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from torch.utils.data import TensorDataset, DataLoader

vectorizer = TfidfVectorizer(max_features=10000)
X_train = vectorizer.fit_transform(train_data["text"])
X_test = vectorizer.transform(test_data["text"])
y_train = train_data["label"]
y_test = test_data["label"]

train_dataset = TensorDataset(torch.FloatTensor(X_train.toarray()), torch.LongTensor(y_train))
test_dataset = TensorDataset(torch.FloatTensor(X_test.toarray()), torch.LongTensor(y_test))

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

LSTM модель

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def prepare_data(texts, labels, max_len=256):
    tokenized = [tokenizer.encode(text, truncation=True, max_length=max_len, add_special_tokens=False) for text in texts]
    padded = [seq + [0]*(max_len - len(seq)) for seq in tokenized]
    return torch.LongTensor(padded), torch.FloatTensor(labels)

X_train, y_train = prepare_data(dataset['train']['text'], dataset['train']['label'])
X_test, y_test = prepare_data(dataset['test']['text'], dataset['test']['label'])

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train,
    test_size=0.2,
    random_state=42,
    shuffle=True
)

batch_size = 64
train_loader = DataLoader(
    TensorDataset(X_train, y_train),
    batch_size=batch_size,
    shuffle=True
)
val_loader = DataLoader(
    TensorDataset(X_val, y_val),
    batch_size=batch_size
)



In [None]:
criterion = nn.BCEWithLogitsLoss()

class IMDB_LSTM(nn.Module):
    def __init__(self):
        super().__init__()

        self.embedding = nn.Embedding(tokenizer.vocab_size, 300)
        self.embedding.weight.data.uniform_(-0.1, 0.1)

        self.lstm = nn.LSTM(
            input_size=300,
            hidden_size=256,
            num_layers=2,
            dropout=0.3,
            batch_first=True,
            bidirectional=True
        )

        self.fc = nn.Linear(256 * 2, 1)
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        x = self.dropout(self.embedding(x))
        lstm_out, _ = self.lstm(x)
        last_out = torch.cat((lstm_out[:, -1, :256], lstm_out[:, 0, 256:]), dim=1)
        return self.fc(self.dropout(last_out)).squeeze(1)


model = IMDB_LSTM().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=1e-4)
criterion = nn.BCEWithLogitsLoss()

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-3,
    steps_per_epoch=len(train_loader),
    epochs=10
)

for epoch in range(10):
    model.train()
    train_loss = 0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        train_loss += loss.item()

    model.eval()
    val_loss = 0
    correct = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            val_loss += criterion(outputs, labels).item()
            preds = torch.round(torch.sigmoid(outputs))
            correct += (preds == labels).sum().item()

    print(f"Epoch {epoch+1}")
    print(f"Train Loss: {train_loss/len(train_loader):.4f} | Val Loss: {val_loss/len(val_loader):.4f} | Val Acc: {correct/len(X_val):.4f}")
    print("-" * 50)

Epoch 1
Train Loss: 0.6932 | Val Loss: 0.6915 | Val Acc: 0.5082
--------------------------------------------------
Epoch 2
Train Loss: 0.6775 | Val Loss: 0.6698 | Val Acc: 0.5982
--------------------------------------------------
Epoch 3
Train Loss: 0.6529 | Val Loss: 0.5472 | Val Acc: 0.7088
--------------------------------------------------
Epoch 4
Train Loss: 0.5010 | Val Loss: 0.4729 | Val Acc: 0.7834
--------------------------------------------------
Epoch 5
Train Loss: 0.3897 | Val Loss: 0.4765 | Val Acc: 0.7778
--------------------------------------------------
Epoch 6
Train Loss: 0.2854 | Val Loss: 0.4001 | Val Acc: 0.8322
--------------------------------------------------
Epoch 7
Train Loss: 0.2118 | Val Loss: 0.4163 | Val Acc: 0.8300
--------------------------------------------------
Epoch 8
Train Loss: 0.1504 | Val Loss: 0.4310 | Val Acc: 0.8512
--------------------------------------------------
Epoch 9
Train Loss: 0.1178 | Val Loss: 0.4456 | Val Acc: 0.8446
----------------

после 8 эпохи, судя по всему, пошло переобучение

In [None]:
model.eval()
test_correct = 0
with torch.no_grad():
    for inputs, labels in DataLoader(TensorDataset(X_test, y_test), batch_size=batch_size):
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = torch.round(torch.sigmoid(model(inputs)))
        test_correct += (outputs == labels).sum().item()


In [None]:
print(f"точность LSTM: {test_correct/len(X_test):.4f}")


точность LSTM: 0.8125


в целом результат достаточно неплохой, оптимизировать я его, конечно же, не буду (ресурсы гугла в бесплатной версии особо не велики, а мне еще улучшить результаты для distilBERT'a надо)

Ниже представлена попытка улучшить результаты DistilBERT'a - чтобы не произошло переобучения

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback
import numpy as np
import torch

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    dropout=0.2,
    attention_dropout=0.2
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

tokenized_data = dataset.map(tokenize_function, batched=True)
tokenized_data.set_format("torch")

split_data = tokenized_data["train"].train_test_split(test_size=0.1)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": np.mean(predictions == labels)}



Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
training_args = TrainingArguments(
    output_dir="./results1",
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    logging_steps=50,
    logging_dir="./logs1",
    report_to="none",
    warmup_steps = 500
)



In [None]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_data["train"],
    eval_dataset=split_data["test"],
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=2),
    ]
)



In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.283,0.254584,0.8964
2,0.2385,0.247469,0.904
3,0.194,0.256379,0.9068
4,0.1314,0.294512,0.908
5,0.1152,0.374657,0.9088
6,0.0677,0.414088,0.9064
7,0.0523,0.482222,0.9096


KeyboardInterrupt: 

По росту validation loss и одновременному снижению training loss заметно переобучение модели... Печально

In [None]:
final_results = trainer.evaluate(tokenized_data["test"])
print(f"Final Test Accuracy: {final_results['eval_accuracy']:.4f}")

In [None]:
def tokenize_function1(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_data1 = dataset.map(tokenize_function1, batched=True)
tokenized_data1.set_format("torch")

split_data1 = tokenized_data1["train"].train_test_split(test_size=0.1)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
model1 = DistilBertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    dropout=0.3,  # увеличили dropout
    attention_dropout=0.3  # добавили dropout для внимания
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results2",
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    logging_steps=50,
    logging_dir="./logs2",
    report_to="none",
    warmup_steps = 500
)

In [None]:
from transformers import EarlyStoppingCallback
trainer1 = Trainer(
    model=model1,
    args=training_args,
    train_dataset=split_data1["train"],
    eval_dataset=split_data1["test"],
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=2),
    ]
)


In [None]:
trainer1.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

Попытка выше была неудачной - слишком долгое обучение

In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import load_dataset

# 1. Загрузка и сокращение данных
dataset = load_dataset("imdb")
small_train = dataset["train"].shuffle(seed=42).select(range(2000))
small_test = dataset["test"].shuffle(seed=42).select(range(2000))

# 2. Токенизация (как в вашем исходном коде)
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_train = small_train.map(tokenize_function, batched=True)
tokenized_test = small_test.map(tokenize_function, batched=True)


In [None]:
from collections import Counter
print(Counter(small_test["label"]))

Counter({1: 1000, 0: 1000})


In [None]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=3e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    logging_steps=50,
    logging_dir="./logs",
    report_to="none",
    warmup_steps = 500
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
!pip install -q transformers datasets scikit-learn

In [None]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score
import numpy as np


dataset = load_dataset("imdb")
small_train = dataset["train"].shuffle(seed=42).select(range(2000))
small_test = dataset["test"].shuffle(seed=42).select(range(1000))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_test = small_test.map(tokenize_function, batched=True)
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Загрузка предобученной модели (на SST-2)
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)


In [None]:
def compute_metrics(pred):
    preds = np.argmax(pred.predictions, axis=1)
    labels = pred.label_ids
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds)
    }

training_args = TrainingArguments(
    output_dir="./results",
    per_device_eval_batch_size=32,
    do_train=False,
    do_eval=True,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)



In [None]:
eval_result = trainer.evaluate()


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mliavdoo[0m ([33mliavdoo-mephi[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
print(eval_result)

{'eval_loss': 0.48373621702194214, 'eval_model_preparation_time': 0.0015, 'eval_accuracy': 0.881, 'eval_f1': 0.8759124087591241, 'eval_runtime': 13.7596, 'eval_samples_per_second': 72.677, 'eval_steps_per_second': 2.326}


In [None]:
print('точность на предобученной модели: ', eval_result['eval_accuracy'])

точность на предобученной модели:  0.881


In [None]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")



Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:

dataset = load_dataset("imdb")
small_train = dataset["train"].shuffle(seed=42).select(range(2000))
small_test = dataset["test"].shuffle(seed=42).select(range(1000))

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_train = small_train.map(tokenize_function, batched=True)
tokenized_test = small_test.map(tokenize_function, batched=True)

tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "label"])
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="no",
    logging_strategy="epoch",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=False,
    save_total_limit=1,
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)


In [None]:
trainer.train()

eval_result = trainer.evaluate()
print('точность: ', eval_result['eval_accuracy'])

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.509,0.335725,0.857,0.865221
2,0.2103,0.319948,0.893,0.888425


точность:  0.893


In [None]:
dataset = load_dataset("imdb")
train_data = dataset["train"].shuffle(seed=42).select(range(8000))
test_data = dataset["test"].shuffle(seed=42).select(range(2000))

tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def tokenize(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_train = train_data.map(tokenize, batched=True)
tokenized_test = test_data.map(tokenize, batched=True)

tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "label"])
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased").to(device)

def compute_metrics(pred):
    preds = np.argmax(pred.predictions, axis=1)
    labels = pred.label_ids
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds)
    }


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="no",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=False,
    fp16=torch.cuda.is_available(),
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4101,0.249708,0.9005,0.894987
2,0.1965,0.214701,0.9145,0.916707
3,0.1225,0.23512,0.921,0.922014


TrainOutput(global_step=750, training_loss=0.24302073923746745, metrics={'train_runtime': 300.6707, 'train_samples_per_second': 79.822, 'train_steps_per_second': 2.494, 'total_flos': 3179217567744000.0, 'train_loss': 0.24302073923746745, 'epoch': 3.0})

In [None]:
eval_result = trainer.evaluate()
print('точность: ', eval_result['eval_accuracy'])

точность:  0.921


### Вроде получили нормальные результаты


In [None]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

from transformers import EarlyStoppingCallback

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    greater_is_better=True,
    num_train_epochs=5,  # поставим 5, но остановимся раньше
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    fp16=torch.cuda.is_available(),
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)


In [None]:
train_data = dataset["train"].shuffle(seed=42)
test_data = dataset["test"].shuffle(seed=42)

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mliavdoo[0m ([33mliavdoo-mephi[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3304,0.224008,0.9105,0.907874
2,0.1821,0.210415,0.9205,0.920221
3,0.1127,0.285599,0.9155,0.917279
4,0.0651,0.329322,0.9115,0.913616


TrainOutput(global_step=1000, training_loss=0.1725749454498291, metrics={'train_runtime': 426.3174, 'train_samples_per_second': 93.827, 'train_steps_per_second': 2.932, 'total_flos': 4238956756992000.0, 'train_loss': 0.1725749454498291, 'epoch': 4.0})

In [None]:
eval_result = trainer.evaluate()
print('точность: ', eval_result['eval_accuracy'])

точность:  0.9205
