<a href="https://colab.research.google.com/github/Jiyang-Liu0/NLP/blob/main/hw4_bert_pos_skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune BERT-based models from Hugging Face on POS-tagging for English and Norwegian

This notebook will guide you through Part 2 of [CS 2731 Homework 4](https://michaelmilleryoder.github.io/cs2731_fall2024/hw4).

Please copy this notebook and name it `{pitt email id}_hw4_bert_pos.ipynb`.

Code for loading and preprocessing the data is provided. You will provide code for training and evaluation using Hugging Face Trainer or PyTorch.

Run all the cells starting from the top, filling in any sections that need to be filled in. Spots you need to fill in are specified.

You will want to duplicate cells in each section for each language (English or Norwegian) or create separate sections in the notebook for separate languages.

**Note**: Please run on GPU by going to Runtime > Change Runtime Type > T4 GPU

The tutorials below from Hugging Face are informative. You can use code from them and adapt to this use case.
* [Token classification (sequence labeling) with Hugging Face](https://huggingface.co/docs/transformers/en/tasks/token_classification)
* [Hugging Face `Trainer` class tutorial](https://huggingface.co/docs/transformers/en/training#train)

# Load required packages

In [1]:
!pip install datasets accelerate conllu

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting conllu
  Downloading conllu-6.0.0-py3-none-any.whl.metadata (21 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading conllu-6.0.0-py3-none-any.whl (16 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Load data

Here you will be loading the training, dev, and test datasets of English and Norwegian text annotated with POS tags. The data are from the [Universal Dependencies](https://universaldependencies.org/) project.

The dataset subset to use (fill in below for `subset_name`) are:
* English: `en_ewt`
* Norwegian: `no_bokmaal`

We will be using the universal part-of-speech tags in the `upos` column, not the tags in the `xpos` column.

Note:  There are 2 written forms of Norwegian: Bokmål and Nynorsk: https://en.wikipedia.org/wiki/Norwegian_language. This data is in the Bokmål written form.

Here are a few links to learn more about the data:
* [Universal Dependencies data format](https://universaldependencies.org/format.html)
* [Hugging Face `universal_dependencies` dataset page](https://huggingface.co/datasets/universal_dependencies)

In [26]:
from datasets import load_dataset

# FILL IN
# subset =  # string subset name: "en_ewt" for English, "no_bokmaal" for Norwegian

# subset = "en_ewt"
subset = "no_bokmaal"

data = load_dataset('universal_dependencies', subset, trust_remote_code=True)
data

Downloading data:   0%|          | 0.00/15.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.25M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15696 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2409 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1939 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['idx', 'text', 'tokens', 'lemmas', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc'],
        num_rows: 15696
    })
    validation: Dataset({
        features: ['idx', 'text', 'tokens', 'lemmas', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc'],
        num_rows: 2409
    })
    test: Dataset({
        features: ['idx', 'text', 'tokens', 'lemmas', 'upos', 'xpos', 'feats', 'head', 'deprel', 'deps', 'misc'],
        num_rows: 1939
    })
})

In [27]:
# Take a look at the part of speech tags

tags = data['train'].features['upos'].feature
tags

ClassLabel(names=['NOUN', 'PUNCT', 'ADP', 'NUM', 'SYM', 'SCONJ', 'ADJ', 'PART', 'DET', 'CCONJ', 'PROPN', 'PRON', 'X', '_', 'ADV', 'INTJ', 'VERB', 'AUX'], id=None)

In [28]:
# Create a column called `upos_str` with the names, not the IDs, of POS tags

def create_tag_names(batch):
  tag_name = {'upos_str': [tags.int2str(idx) for idx in batch['upos']]}
  return tag_name

data = data.map(create_tag_names)

Map:   0%|          | 0/15696 [00:00<?, ? examples/s]

Map:   0%|          | 0/2409 [00:00<?, ? examples/s]

Map:   0%|          | 0/1939 [00:00<?, ? examples/s]

# Tokenization
Fill in code in this section to prepare the input with subword tokenization for BERT. You can follow the process in the [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification).

Here is also where you will decide on which BERT-based pre-trained model you will fine-tune, since you will need to match its tokenization.
Feel free to search Hugging Face for BERT variants or to use recommended ones in Hugging Face documentation. For Norwegian, you'll want a pretrained BERT model that can handle Norwegian (in Bokmål written form).

In [29]:
from transformers import AutoTokenizer

# pretrained_model = "bert-base-cased"

pretrained_model = "NbAiLab/nb-bert-base"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model)


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



In [45]:
# text = "This is a sample sentence for tokenization."
text = "Detta är ett exempel på mening för tokenisering."

tokenized_output = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
print(tokenized_output)


{'input_ids': tensor([[  101, 35212, 10137, 10664, 34825, 10217, 87927, 10847, 18436, 62222,
         19232,   119,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


Subword tokenization will add special tokens such as `[CLS]` which we want the classifier to ignore.

It also splits some words into multiple tokens. We'll have to re-align those to assign just one part-of-speech tag to each word.

Fill in code here to do this alignment, as well as prepare a tokenized version of the dataset. You may adapt code from the [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification).

In [46]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(pretrained_model)

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        padding=True,
        truncation=True,
        is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples["upos"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                aligned_labels.append(-100)
            elif word_idx != previous_word_idx:
                aligned_labels.append(label[word_idx])
            else:
                aligned_labels.append(-100)
            previous_word_idx = word_idx
        labels.append(aligned_labels)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_dataset = data.map(tokenize_and_align_labels, batched=True)




Map:   0%|          | 0/2409 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [47]:
print(tokenized_dataset["train"][0])

{'idx': '000001', 'text': 'Lam og piggvar på bryllupsmenyen', 'tokens': ['Lam', 'og', 'piggvar', 'på', 'bryllupsmenyen'], 'lemmas': ['lam', 'og', 'piggvar', 'på', 'bryllupsmeny'], 'upos': [0, 9, 0, 5, 0], 'xpos': [None, None, None, None, None], 'feats': ["{'Definite': 'Ind', 'Gender': 'Neut', 'Number': 'Sing'}", 'None', "{'Definite': 'Ind', 'Gender': 'Masc', 'Number': 'Sing'}", 'None', "{'Definite': 'Def', 'Gender': 'Masc', 'Number': 'Sing'}"], 'head': ['0', '3', '1', '5', '1'], 'deprel': ['root', 'cc', 'conj', 'mark', 'xcomp'], 'deps': ['None', 'None', 'None', 'None', 'None'], 'misc': ['None', 'None', 'None', 'None', 'None'], 'upos_str': ['NOUN', 'CCONJ', 'NOUN', 'SCONJ', 'NOUN'], 'input_ids': [101, 44068, 10156, 24109, 21127, 16648, 10217, 33989, 27652, 11435, 13221, 11418, 18130, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

# Prepare evaluation

Evaluation code is provided here.

Source: [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification)

In [34]:
!pip install seqeval
!pip install evaluate

import evaluate
seqeval = evaluate.load('seqeval')



In [35]:
import numpy as np

label_list = data['train'].features['upos'].feature.names
labels = data['train'][0]['upos']
labels = [label_list[i] for i in labels]

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# Train (fine-tune) the model

> 添加区块引用符号



Fill in code here to load your pretrained model and do fine-tuning using the `Trainer` class or PyTorch.

In [36]:
from transformers import AutoModelForTokenClassification, Trainer, TrainingArguments
import evaluate
import numpy as np
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
label_list = data['train'].features['upos'].feature.names
num_labels = len(label_list)

model = AutoModelForTokenClassification.from_pretrained(pretrained_model, num_labels=num_labels)

seqeval = evaluate.load('seqeval')

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

training_args = TrainingArguments(
    output_dir="./results_no",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to=[],
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()


model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at NbAiLab/nb-bert-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2551,0.035012,0.988111,0.988197,0.988154,0.989772
2,0.0311,0.029955,0.989793,0.98985,0.989821,0.991174
3,0.0198,0.028809,0.990983,0.991242,0.991113,0.992329




TrainOutput(global_step=2943, training_loss=0.06811347803290183, metrics={'train_runtime': 1633.6371, 'train_samples_per_second': 28.824, 'train_steps_per_second': 1.802, 'total_flos': 3649463554980672.0, 'train_loss': 0.06811347803290183, 'epoch': 3.0})

# Test performance

Fill in code here to evaluate your fine-tuned model's performance on the test set of the tokenized dataset.

You will be reporting accuracy in your report.

In [37]:
test_results = trainer.evaluate(tokenized_dataset["test"])

print("Test Accuracy:", test_results["eval_accuracy"])




Test Accuracy: 0.9886538076486685


# Run on an example sentence

Fill in code here to run your classifier on an example sentence of your choice for both English and Norwegian models. You will likely have to load these models from checkpoints created during training.

You will provide the predicted tags for example sentences in your report.

In [50]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import numpy as np

model_path = "./results/checkpoint-2352"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

label_list = data['train'].features['upos'].feature.names
model.config.id2label = {i: label for i, label in enumerate(label_list)}

example_sentence_en = "An apple a day keeps the doctor away."
# example_sentence_no = "Ett äpple om dagen håller doktorn borta."

def predict_tags(sentence, tokenizer, model):
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)

    predictions = outputs.logits.argmax(dim=2)

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    predicted_tags = [model.config.id2label[p.item()] for p in predictions[0]]

    filtered_tokens_tags = [(token, tag) for token, tag in zip(tokens, predicted_tags) if token not in tokenizer.all_special_tokens]

    return filtered_tokens_tags

print("English Sentence Prediction:")
print(predict_tags(example_sentence_en, tokenizer, model))

# print("Norwegian Sentence Prediction:")
# print(predict_tags(example_sentence_no, tokenizer, model))


English Sentence Prediction:
[('An', 'DET'), ('apple', 'NOUN'), ('a', 'DET'), ('day', 'NOUN'), ('keeps', 'VERB'), ('the', 'DET'), ('doctor', 'NOUN'), ('away', 'ADV'), ('.', 'PUNCT')]
