Tokenization can have a significant impact on the quality of models. In this task, you will need to reduce the tokenizer to reduce the size of the model while maintaining quality. You will solve the problem of sentiment analysis. In the first part of the notebook, we implement a baseline, which measures the quality of the original model. In the second part, you need to reduce the tokenizer by removing at least 50% of tokens from it, reinitialize the embedding layer of BERT model and retrain the model. The classification quality should decrease by no more than 2% in terms of the F-measure, compared to the original model.

### Baseline

In [2]:
!pip install transformers datasets evaluate accelerate tokenizers

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [3]:
import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    BertTokenizerFast
)
from datasets import load_dataset
import evaluate
import os
from tokenizers import BertWordPieceTokenizer

In [4]:
# 1. Загрузка данных
dataset = load_dataset("imdb")
train_data = dataset["train"].shuffle(seed=42).select(range(5000))
test_data = dataset["test"].shuffle(seed=42).select(range(1000))

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [71]:
# 2. Исходный токенизатор и модель
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

train_dataset = train_data.map(tokenize_function, batched=True)
test_dataset = test_data.map(tokenize_function, batched=True)

In [6]:
# 3. Обучение исходной модели
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
)

model.cuda()

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    report_to=['tensorboard'],
    bf16=True,
)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.339915,0.863
2,No log,0.292434,0.881
3,No log,0.296731,0.895




TrainOutput(global_step=471, training_loss=0.2479311544171311, metrics={'train_runtime': 180.9323, 'train_samples_per_second': 82.904, 'train_steps_per_second': 2.603, 'total_flos': 1973332915200000.0, 'train_loss': 0.2479311544171311, 'epoch': 3.0})

In [7]:
# 4. Оценка исходной модели
original_accuracy = trainer.evaluate()["eval_accuracy"]
print(f"Исходная точность: {original_accuracy:.4f}")



Исходная точность: 0.8950


### Task 1

Remove at least 50% of tokens from original tokenizer. Reinitialize new tokenizer. Give to new tokenizer 'new_tokenizer' name.

In [8]:
from collections import Counter

def get_top_k_tokens(tokenizer, texts, k: int = 10000) -> list:
    """
    Извлекает топ-`k` наиболее часто встречающихся токенов из списка текстов с использованием заданного токенизатора.

    1) Проходим по списку текстов, токенизирует каждый текст с помощью предоставленного токенизатора.
    2) Подсчитываем количество появлений каждого токена и возвращает список из `k` наиболее частотных токенов.

    Входные параметры:
    - tokenizer (Tokenizer): Объект токенизатора.
    - texts (list): Список текстов для обучения.
    - k (int): Количество самых частотных токенов, которые необходимо вернуть.

    Возвращает:
    - list: Список из `k` наиболее часто встречающихся токенов.
    """
    token_counts = Counter()
    batch_size = 1000

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        encodings = tokenizer.batch_encode_plus(batch_texts, padding=False, truncation=False)
        # encodings = np.array(encodings['input_ids'])
        for tokens in encodings['input_ids']:
            token_counts.update(tokens)

    # top_k_tokens = [token for token, count in token_counts.most_common(k)]

    return token_counts

In [9]:
texts = train_data['text'] + test_data['text']

In [10]:
top_tokens = get_top_k_tokens(tokenizer, texts)

Token indices sequence length is longer than the specified maximum sequence length for this model (936 > 512). Running this sequence through the model will result in indexing errors


In [160]:
top_tokens_ids = {k : v for k,v in sorted(top_tokens.items(), key= lambda x: x[1])}
top_tokens_keys = {tokenizer.decode([k]) : v for k,v in sorted(top_tokens.items(), key= lambda x: x[1])}
val_median = np.median(list(top_tokens_keys.values()))
top_tokens_to_delete = {tokenizer.decode([k]) : v for k,v in sorted(top_tokens.items(), key= lambda x: x[1]) if v <= 20}
top_tokens_to_stay = {tokenizer.decode([k]) : v for k,v in sorted(top_tokens.items(), key= lambda x: x[1]) if v > 20}

In [161]:
#TODO Your code is here
from copy import deepcopy
from tqdm import tqdm
import json
from tokenizers import models

new_tokenizer = deepcopy(tokenizer)
model_state = json.loads(new_tokenizer.backend_tokenizer.model.__getstate__())
for word in top_tokens_to_delete:
    del model_state["vocab"][word]

print(len(model_state["vocab"]))
new_tok_id = range(len(model_state["vocab"]))
new_model_vocab = {}
mapping = {}
for new_tok_id, (tok, tok_id) in zip(new_tok_id, model_state["vocab"].items()):
    new_model_vocab[tok] = new_tok_id
    mapping[tok_id] = new_tok_id
model_state["vocab"] = new_model_vocab
    
model_class = getattr(models, model_state.pop("type"))
new_tokenizer.backend_tokenizer.model = model_class(**model_state)


print(f"\nРазмеры словарей:")
print(f"Исходный: {len(tokenizer.vocab)}")
print(f"Новый: {len(new_tokenizer.vocab)}")
print(f"Удалено токенов: {len(tokenizer.vocab) - len(new_tokenizer.vocab)}")
print(f"Удалено токенов в %: {(len(tokenizer.vocab) - len(new_tokenizer.vocab)) * 100 / len(tokenizer.vocab)}")

14759

Размеры словарей:
Исходный: 30522
Новый: 14759
Удалено токенов: 15763
Удалено токенов в %: 51.644715287333725


In [162]:
vocab_inv = {v:k for k,v in new_tokenizer.vocab.items()}
vocab_inv[new_tokenizer(['celebrate'])['input_ids'][0][3]]

'##brate'

### Task 2:

Initialize new Embedding layer of BERT model according to new tokenizer.
Retrain model on classification task. The classification quality should decrease by no more than 2% in terms of the F-measure, compared to the original model.

In [163]:
#TODO Your code is here


# 7. Переинициализация модели с новым словарем
new_model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    ignore_mismatched_sizes=True  # Важно для изменения размера эмбеддингов!
)

new_embeds = [0] * len(new_tokenizer.vocab)
for old_id, new_id in mapping.items():
    new_embeds[new_id] = new_model.get_input_embeddings().weight.data[old_id]
new_embeds = torch.stack(new_embeds)

new_model.resize_token_embeddings(len(new_tokenizer.vocab))

target_model_add_tokens_weight_data = new_model.get_input_embeddings().weight.data.clone()

new_model.get_input_embeddings().weight.data = new_embeds

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [164]:
# 8. Обучение с новым токенизатором
def new_tokenize_function(examples):
    return new_tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

new_train_dataset = train_data.map(new_tokenize_function, batched=True)
new_test_dataset = test_data.map(new_tokenize_function, batched=True)

new_trainer = Trainer(
    model=new_model,
    args=training_args,
    train_dataset=new_train_dataset,
    eval_dataset=new_test_dataset,
    compute_metrics=compute_metrics,
)

new_trainer.train()
new_accuracy = new_trainer.evaluate()["eval_accuracy"]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.342475,0.858
2,No log,0.294698,0.875
3,No log,0.288918,0.897




In [166]:
# 9. Результаты
print("\nРезультаты сравнения:")
print(f"Исходная точность: {original_accuracy:.4f}")
print(f"Точность с новым словарем: {new_accuracy:.4f}")
print(f"Разница: {original_accuracy - new_accuracy:.4f}")
print(f"Экономия памяти: {(len(tokenizer.vocab) - len(new_tokenizer.vocab))/len(tokenizer.vocab):.2%}")


Результаты сравнения:
Исходная точность: 0.8950
Точность с новым словарем: 0.8970
Разница: -0.0020
Экономия памяти: 51.64%
