# Week 4: Transfer Learning, BERT (Seminar)

### Using pretrained transformers (for fun, profit and 1 point)

There are many toolkits that let you access pretrained transformer models (like we used pretrained embeddings earlier), but the most powerful and convenient by far is 🤗[`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pretrained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pretrained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [3]:
import transformers

In [4]:
sentiment_clf = transformers.pipeline(
    task="sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
)

sentiment_clf(["transformers library can be really useful!", "YSDA midterm is soon"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9959487915039062},
 {'label': 'NEGATIVE', 'score': 0.986364483833313}]

In [5]:
transformers.pipelines.SUPPORTED_TASKS.keys()

dict_keys(['audio-classification', 'automatic-speech-recognition', 'text-to-audio', 'feature-extraction', 'text-classification', 'token-classification', 'question-answering', 'table-question-answering', 'visual-question-answering', 'document-question-answering', 'fill-mask', 'summarization', 'translation', 'text2text-generation', 'text-generation', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-audio-classification', 'image-classification', 'image-feature-extraction', 'image-segmentation', 'image-to-text', 'image-text-to-text', 'object-detection', 'zero-shot-object-detection', 'depth-estimation', 'video-classification', 'mask-generation', 'image-to-image', 'keypoint-matching'])

But how can we find out which model is suitable for chosen task in such a big models space?

Option 1: Using search and filters in [web](https://huggingface.co/models) (user-friendly)

Option 2: Using `huggingface_hub` library to access API from Python (if you want to automate some process)


In [6]:
import huggingface_hub

In [7]:
some_model = next(huggingface_hub.list_models())

some_model

ModelInfo(id='deepseek-ai/DeepSeek-OCR', author=None, sha=None, created_at=datetime.datetime(2025, 10, 17, 6, 22, 5, tzinfo=datetime.timezone.utc), last_modified=None, private=False, disabled=None, downloads=730692, downloads_all_time=None, gated=None, gguf=None, inference=None, inference_provider_mapping=None, likes=1853, library_name=None, tags=['safetensors', 'deepseek_vl_v2', 'deepseek', 'vision-language', 'ocr', 'custom_code', 'image-text-to-text', 'multilingual', 'arxiv:2510.18234', 'license:mit', 'region:us'], pipeline_tag='image-text-to-text', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, trending_score=1853, siblings=None, spaces=None, safetensors=None, security_repo_status=None, xet_enabled=None)

In [8]:
filter = (
    "sentiment-analysis",
    "pytorch",
    "ru",
)

filtered_models = huggingface_hub.list_models(
    filter=filter,
    sort="downloads",
    limit=10,
)

print(f"Filtered by {filter}:")
for model in filtered_models:
    print(f"- https://huggingface.co/{model.id} ({model.downloads} downloads, {model.likes} likes)")

Filtered by ('sentiment-analysis', 'pytorch', 'ru'):
- https://huggingface.co/seara/rubert-tiny2-russian-sentiment (162964 downloads, 29 likes)
- https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1 (53720 downloads, 60 likes)
- https://huggingface.co/r1char9/rubert-base-cased-russian-sentiment (6771 downloads, 12 likes)
- https://huggingface.co/yangheng/deberta-v3-large-absa-v1.1 (535 downloads, 20 likes)
- https://huggingface.co/seara/rubert-tiny2-russian-emotion-detection-ru-go-emotions (443 downloads, 9 likes)
- https://huggingface.co/seara/rubert-base-cased-russian-emotion-detection-cedr (297 downloads, 3 likes)
- https://huggingface.co/seara/rubert-base-cased-russian-emotion-detection-ru-go-emotions (153 downloads, 4 likes)
- https://huggingface.co/seara/rubert-base-cased-russian-sentiment (110 downloads, 11 likes)
- https://huggingface.co/oxygeneDev/sentiment-multilingual (39 downloads, 0 likes)
- https://huggingface.co/seara/rubert-tiny2-russian-emotion-detection-cedr (31 

Imagine the situation when you have a long text to read and a lack of time. Luckily, you've got an option to use one of pipelines! But which one?...

**Task 1 (0.5 points)**
- Find a suitable pipeline and model for text below
- Apply model to long text to get a short one
- Pretty-print the result and give an opinion if short text is good or not



In [11]:
from transformers import pipeline
long_text = """
The widespread adoption of remote work, accelerated by global events in the early 2020s, has triggered a significant and likely permanent shift in how we think about the workplace. This transition away from the traditional central office is having profound and multifaceted effects on urban economies, reshaping everything from commercial real estate to local small businesses.

One of the most immediate and visible impacts has been on the commercial real estate sector. With companies downsizing their physical footprints or adopting fully remote models, demand for office space has plummeted. This has led to rising vacancy rates, downward pressure on commercial rent prices, and a re-evaluation of the financial viability of large office buildings. City governments, which often rely heavily on property taxes from these high-value commercial properties, are now facing substantial budget shortfalls.

Furthermore, the daily rhythm of city centers has changed dramatically. The decline in the number of commuters has had a ripple effect on local businesses that once thrived on their patronage. Lunchtime cafes, after-work bars, dry cleaners, and public transit systems have all experienced a significant drop in revenue. This "doughnut effect" describes a phenomenon where the economic activity hollows out in the city center and increases in suburban residential areas as people work from home and spend their money locally.

However, it's not all negative. This shift also presents new opportunities. Some urban planners see a chance to repurpose vacant office buildings into much-needed residential housing, which could help address housing shortages and revitalize neighborhoods by creating 24/7 communities. Additionally, the ability to work remotely has spurred a reversal of rural depopulation in some regions, as professionals seek a better quality of life outside of major metropolitan areas, potentially distributing economic growth more evenly.

In conclusion, the remote work revolution is fundamentally restructuring urban economies. While it presents serious challenges to established systems like commercial real estate and downtown commerce, it also opens the door to innovative urban renewal and a more geographically dispersed economic landscape. The long-term effects will depend on how effectively cities and businesses can adapt to this new, more flexible paradigm.
"""
summarization_pipeline = pipeline("summarization")

short_text = summarization_pipeline(
    long_text,
    max_length=130,
    min_length=30,
    do_sample=False
)[0]['summary_text']

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [12]:
assert len(long_text) / len(short_text) > 5, "Too long, didn't read"

One of possible semi-supervised tasks used while BERT training is Masked Language Modeling. So our model have some text prediction capabilities!



In [14]:
mlm_model = transformers.pipeline(
    task="fill-mask",
    model="bert-base-cased"
)

mlm_model("My name is [MASK] Shady!")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


[{'score': 0.0911569893360138,
  'token': 22191,
  'token_str': 'Slim',
  'sequence': 'My name is Slim Shady!'},
 {'score': 0.025677727535367012,
  'token': 2791,
  'token_str': 'Captain',
  'sequence': 'My name is Captain Shady!'},
 {'score': 0.023223936557769775,
  'token': 3056,
  'token_str': 'Miss',
  'sequence': 'My name is Miss Shady!'},
 {'score': 0.016291704028844833,
  'token': 4479,
  'token_str': 'Jimmy',
  'sequence': 'My name is Jimmy Shady!'},
 {'score': 0.013304632157087326,
  'token': 13960,
  'token_str': 'Mister',
  'sequence': 'My name is Mister Shady!'}]

In order to make result more readable we can just take top-1 result:

In [15]:
mlm_model("My name is [MASK] Shady!")[0]["sequence"]

'My name is Slim Shady!'

**Task 2 (0.5 points)**
- Using BERT's ability to solve MLM task, find out answers on the following questions
- Perform some fact-checking, don't trust LLMs!

**Questions:**
- When YSDA was founded?
- Who invented radio first?
- What is the fifth Fibonacci number?

---

### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a PyTorch `nn.Module` with pretrained weights

You can use such models as part of your regular PyTorch code: insert it as a layer in your model, apply to a batch of data, backpropagate, optimize, etc.

In [16]:
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
model = transformers.AutoModel.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [17]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    "I have no idea what pneumonoultramicroscopicsilicovolcanoconiosis is."
]

tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")
print("Tokenized:")
print(tokens_info)

print("\nDetokenized:")
for i in range(3):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

Tokenized:
{'input_ids': tensor([[  101,  5355,  1010,  1045,  2572,  2115,  2269,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  2166,  2003,  2054,  6433,  2043,  2017,  1005,  2128,  5697,
          2437,  2060,  3488,  1012,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  1045,  2031,  2053,  2801,  2054,  1052,  2638,  2819, 17175,
         11314,  6444,  2594,  7352, 26461, 27572, 11261,  6767, 15472,  6761,
          8663, 10735,  2483,  2003,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1,

You can see some special tokens appeared besides our original text. They are usually used to give model some additional information, so model treats them in individual way.

You can list all special tokens used by tokenizer (moreover, you can add your own special tokens, but make sure you will show them to your model while training):

In [18]:
tokenizer._special_tokens_map

{'bos_token': None,
 'eos_token': None,
 'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]',
 'additional_special_tokens': []}

In [19]:
tokenizer("First sentence", "Second sentence", return_token_type_ids=True)

{'input_ids': [101, 2034, 6251, 102, 2117, 6251, 102], 'token_type_ids': [0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

It's ineffective to put all possible tokens in vocabulary, but one also want to handle all possible text sequences instead of putting UNK everywhere.

WordPiece tokenization is here to help!

In [20]:
reversed_vocab = {token_id: token for token, token_id in tokenizer.vocab.items()}

In [21]:
for token_id in tokens_info["input_ids"][2]:
    print(reversed_vocab[token_id.item()], end=' ')

[CLS] i have no idea what p ##ne ##um ##ono ##ult ##ram ##ic ##ros ##copic ##sil ##ico ##vo ##lc ##ano ##con ##ios ##is is . [SEP] 

Now you can apply tokenized data with model.

Depending on your task, you can use different part of output. For example, `[CLS]`-token output can be obtained by `pooler_output` key in model output.

In [22]:
import torch

In [23]:
with torch.no_grad():
    out = model(**tokens_info)

print(out['pooler_output'])

tensor([[-0.8854, -0.4722, -0.9392,  ..., -0.8081, -0.6955,  0.8748],
        [-0.9297, -0.5161, -0.9334,  ..., -0.9017, -0.7492,  0.9201],
        [-0.6808, -0.1979, -0.7096,  ..., -0.6691, -0.4557,  0.7595]])


Transformers knowledge hub: https://huggingface.co/transformers/



---



### Visualizing BERT

Interpretability of models is one of key factors of understanding their behaviour.

Neural Networks are harder to interpret than Classic ML models, but still it's not impossible!

Remember Attention mechanism? It's human-understandable concept: look closely to tokens which are more valuable for context of the current one.

In [24]:
!pip install bertviz

Collecting bertviz
  Downloading bertviz-1.4.1-py3-none-any.whl.metadata (19 kB)
Collecting boto3 (from bertviz)
  Downloading boto3-1.40.59-py3-none-any.whl.metadata (6.6 kB)
Collecting jedi>=0.16 (from IPython>=7.14->bertviz)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting botocore<1.41.0,>=1.40.59 (from boto3->bertviz)
  Downloading botocore-1.40.59-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3->bertviz)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.15.0,>=0.14.0 (from boto3->bertviz)
  Downloading s3transfer-0.14.0-py3-none-any.whl.metadata (1.7 kB)
Downloading bertviz-1.4.1-py3-none-any.whl (157 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.5/157.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading boto3-1.40.59-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m15.4 MB/s[0m eta [

In [25]:
from transformers import AutoTokenizer, AutoModel, utils
from bertviz import model_view, head_view

input_text = "Every time I try to interpret BERT model behaviour, I find new interesting patterns"
model = AutoModel.from_pretrained("bert-base-cased", output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])

model_view(attention, tokens)

<IPython.core.display.Javascript object>

In [26]:
head_view(attention, tokens)

<IPython.core.display.Javascript object>

Another possible task for BERT training is Next Sentence Prediction.

How BERT's heads looks at tokens in that case?

In [27]:
inputs = tokenizer.encode("I'm waiting for important call", "I can't go out right now", return_tensors="pt")
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])

model_view(attention, tokens)

<IPython.core.display.Javascript object>

In [28]:
head_view(attention, tokens)

<IPython.core.display.Javascript object>

It looks interesting, doesn't it?

If you want to find out more about attention patterns, you can refer to special "field" of science - [BERTology](https://huggingface.co/docs/transformers/main/en/bertology).



---



### Tuning pretrained transfomers (for your own task and 2 points)

Important benefit of using big models is their ability to adapt to various tasks without spending a lot of time and resources for full training.

You could've heard about backbone models in another ML tasks, when they're tuned using specific data.

It's possible to tune model's weights directly, but you also can freeze model, use its outputs as knowledge and then extract neccessary information using much smaller neural networks.

#### Introduction

Here's an example of tuned BERT base model for Named Entity Recognition (NER) task:

In [29]:
tokenizer = transformers.AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = transformers.AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [30]:
model

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

As you can see, there's an additional classifier besides original BERT content. That layer is used to predict NER-classes for each BERT's token output.

BERT is suitable for tuning for different tasks since it outputs token embeddings and the whole data embedding in `[CLS]`-token as well.

#### Data preparation

In [31]:
import datasets

In [32]:
dataset = datasets.load_dataset("lhoestq/conll2003")

dataset_infos.json: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/281k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/259k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [33]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [34]:
dataset["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

Since BERT tokenization is different from the dataset's one, we need to fix that divergence.

**Task 3 (0.5 points)**
- Align dataset token labels to WordPiece tokens
- Handle special tokens as well

In [37]:
from transformers import AutoTokenizer, DataCollatorForTokenClassification
import datasets

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_and_align_labels(samples):
    tokenized_inputs = tokenizer(
        samples["tokens"],
        truncation=True,
        is_split_into_words=True
    )

    labels = []
    for i, original_labels in enumerate(samples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = []

        previous_word_idx = None
        for word_idx in word_ids:
            # Special tokens get label -100 (ignored in loss)
            if word_idx is None:
                aligned_labels.append(-100)
            # First token of a word gets the original label
            elif word_idx != previous_word_idx:
                aligned_labels.append(original_labels[word_idx])
            # Subsequent tokens of the same word get -100 (or you could use the same label)
            else:
                aligned_labels.append(-100)

            previous_word_idx = word_idx

        labels.append(aligned_labels)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [38]:
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [39]:
tokenized_dataset["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'input_ids': [101,
  7270,
  22961,
  1528,
  1840,
  1106,
  21423,
  1418,
  2495,
  12913,
  119,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, -100, 0, -100]}

In [40]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Now dataset is ready to be used by BERT.

#### Model preparation

For our task we can use `AutoModelForTokenClassification`, which already provides required architecture with token classifier (e.g. classifier itself, class outputs).

You can handle these things by yourself: create PyTorch model class, init BERT model and Linear layer for classification, then override forward method and so on...

`AutoModelForTokenClassification` is chosen for the sake of simplicity, but it's still required for MLE to be capable of doing it with bare hands.

In [41]:
from transformers import AutoModelForTokenClassification

id2label = {0: "O", 1: "B-PER", 2: "I-PER", 3: "B-ORG", 4: "I-ORG", 5: "B-LOC", 6: "I-LOC", 7: "B-MISC", 8: "I-MISC"}
label2id = {label: id for id, label in id2label.items()}

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=9,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Evaluation

Evaluation is crucial while writing papers or reporting your work results. Sometimes it can be tricky and own implementation can be buggy, so it usually preferred to calculate metrics using frameworks.

In [42]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=3871fea5863051201cae9e7b966cdb94b18e52d39cccc15a6d3df96eb76e54a1
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


Let's prepare `compute_metrics` function for the following training loop:

In [43]:
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score
from seqeval.scheme import IOB2

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    y_true = []
    y_pred = []
    for i in range(len(predictions)):
        y_true_sample = []
        y_pred_sample = []
        for j in range(len(predictions[i])):
            if labels[i][j] == -100:
                continue

            y_true_sample.append(id2label[int(labels[i][j])])
            y_pred_sample.append(id2label[int(predictions[i][j])])

        y_true.append(y_true_sample)
        y_pred.append(y_pred_sample)

    return {
        "precision": precision_score(y_true, y_pred, mode="strict", scheme=IOB2),
        "recall": recall_score(y_true, y_pred, mode="strict", scheme=IOB2),
        "f1": f1_score(y_true, y_pred, mode="strict", scheme=IOB2),
    }

#### Training

**Task 4 (0.5 points)**
- Choose proper hyperparameters for tuning the model
- Setup HF Trainer
- Check correctness using training results


In [46]:
from transformers import TrainingArguments, Trainer, AutoModelForTokenClassification
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score
from seqeval.scheme import IOB2

# Model setup
id2label = {0: "O", 1: "B-PER", 2: "I-PER", 3: "B-ORG", 4: "I-ORG",
            5: "B-LOC", 6: "I-LOC", 7: "B-MISC", 8: "I-MISC"}
label2id = {label: id for id, label in id2label.items()}

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=9,
    id2label=id2label,
    label2id=label2id,
)

# Compute metrics function
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    y_true = []
    y_pred = []

    for i in range(len(predictions)):
        y_true_sample = []
        y_pred_sample = []

        for j in range(len(predictions[i])):
            if labels[i][j] == -100:
                continue
            y_true_sample.append(id2label[int(labels[i][j])])
            y_pred_sample.append(id2label[int(predictions[i][j])])

        y_true.append(y_true_sample)
        y_pred.append(y_pred_sample)

    return {
        "precision": precision_score(y_true, y_pred, mode="strict", scheme=IOB2),
        "recall": recall_score(y_true, y_pred, mode="strict", scheme=IOB2),
        "f1": f1_score(y_true, y_pred, mode="strict", scheme=IOB2),
    }

# Training arguments with proper hyperparameters
training_args = TrainingArguments(
    output_dir="./bert-ner",
    eval_strategy="steps",
    eval_steps=50,
    logging_steps=50,
    logging_dir="./logs",
    report_to="none",
    learning_rate=2e-5,  # Common learning rate for BERT fine-tuning
    num_train_epochs=3,  # 3 epochs is standard for fine-tuning
    per_device_train_batch_size=16,  # Adjust based on GPU memory
    per_device_eval_batch_size=16,
    weight_decay=0.01,  # L2 regularization
    save_strategy="steps",
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],  # Need to define tokenized_dataset
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


Let's check metrics before training:

In [48]:
results = trainer.evaluate(tokenized_dataset["test"])
print(results)

{'eval_loss': 2.252594232559204, 'eval_model_preparation_time': 0.0031, 'eval_precision': 0.0122184954480115, 'eval_recall': 0.009029745042492918, 'eval_f1': 0.010384850335980453, 'eval_runtime': 10.4163, 'eval_samples_per_second': 331.499, 'eval_steps_per_second': 20.737}


In [49]:
trainer.train()

Step,Training Loss,Validation Loss,Model Preparation Time,Precision,Recall,F1
50,0.7359,0.344187,0.0031,0.493564,0.509761,0.501532
100,0.216,0.158478,0.0031,0.757306,0.758835,0.75807
150,0.1212,0.110409,0.0031,0.876891,0.858297,0.867494
200,0.1104,0.091726,0.0031,0.867893,0.881185,0.874489
250,0.0979,0.076731,0.0031,0.910338,0.893639,0.901911
300,0.081,0.071764,0.0031,0.904859,0.896331,0.900575
350,0.0662,0.068751,0.0031,0.898732,0.906597,0.902647
400,0.0652,0.063652,0.0031,0.914026,0.912487,0.913256
450,0.0707,0.057188,0.0031,0.93078,0.911982,0.921285
500,0.0649,0.051807,0.0031,0.932299,0.920061,0.926139


TrainOutput(global_step=2634, training_loss=0.05198077949141587, metrics={'train_runtime': 1143.3971, 'train_samples_per_second': 36.84, 'train_steps_per_second': 2.304, 'total_flos': 1050534559887048.0, 'train_loss': 0.05198077949141587, 'epoch': 3.0})

In [50]:
results = trainer.evaluate(tokenized_dataset["test"])
print(results)

{'eval_loss': 0.11831282079219818, 'eval_model_preparation_time': 0.0031, 'eval_precision': 0.9118733509234829, 'eval_recall': 0.9178470254957507, 'eval_f1': 0.9148504367775523, 'eval_runtime': 9.7309, 'eval_samples_per_second': 354.848, 'eval_steps_per_second': 22.197, 'epoch': 3.0}


Compare test metrics before and after training. Did we succeed?

**Task 5 (1 point)**
- Compare our model's result with `dslim/bert-base-NER`
- Try to improve our model's quality. Choose any option:
  - Play with training hyperparameters (batch_size, lr, epochs, etc.)
  - Apply some training techniques (warm-up, lr-scheduling, etc.)
  - Perform error analysis and find model's weak spots (this option doesn't require fixing them)
  - Your very own idea
- Write a small report (up to 5 steps, results and conclusions) on the work done in "Tuning pretrained transformers" part