<a href="https://colab.research.google.com/github/Jiyang-Liu0/Data-structure-and-algrithm/blob/main/hw4_bert_pos_skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune BERT-based models from Hugging Face on POS-tagging for English and Norwegian

This notebook will guide you through Part 2 of [CS 2731 Homework 4](https://michaelmilleryoder.github.io/cs2731_fall2024/hw4).

Please copy this notebook and name it `{pitt email id}_hw4_bert_pos.ipynb`.

Code for loading and preprocessing the data is provided. You will provide code for training and evaluation using Hugging Face Trainer or PyTorch.

Run all the cells starting from the top, filling in any sections that need to be filled in. Spots you need to fill in are specified.

You will want to duplicate cells in each section for each language (English or Norwegian) or create separate sections in the notebook for separate languages.

**Note**: Please run on GPU by going to Runtime > Change Runtime Type > T4 GPU

The tutorials below from Hugging Face are informative. You can use code from them and adapt to this use case.
* [Token classification (sequence labeling) with Hugging Face](https://huggingface.co/docs/transformers/en/tasks/token_classification)
* [Hugging Face `Trainer` class tutorial](https://huggingface.co/docs/transformers/en/training#train)

# Load required packages

In [None]:
!pip install datasets accelerate conllu

# Load data

Here you will be loading the training, dev, and test datasets of English and Norwegian text annotated with POS tags. The data are from the [Universal Dependencies](https://universaldependencies.org/) project.

The dataset subset to use (fill in below for `subset_name`) are:
* English: `en_ewt`
* Norwegian: `no_bokmaal`

We will be using the universal part-of-speech tags in the `upos` column, not the tags in the `xpos` column.

Note:  There are 2 written forms of Norwegian: Bokmål and Nynorsk: https://en.wikipedia.org/wiki/Norwegian_language. This data is in the Bokmål written form.

Here are a few links to learn more about the data:
* [Universal Dependencies data format](https://universaldependencies.org/format.html)
* [Hugging Face `universal_dependencies` dataset page](https://huggingface.co/datasets/universal_dependencies)

In [None]:
from datasets import load_dataset

# FILL IN
subset =  # string subset name: "en_ewt" for English, "no_bokmaal" for Norwegian

data = load_dataset('universal_dependencies', subset, trust_remote_code=True)
data

In [None]:
# Take a look at the part of speech tags

tags = data['train'].features['upos'].feature
tags

In [None]:
# Create a column called `upos_str` with the names, not the IDs, of POS tags

def create_tag_names(batch):
  tag_name = {'upos_str': [tags.int2str(idx) for idx in batch['upos']]}
  return tag_name

data = data.map(create_tag_names)

# Tokenization
Fill in code in this section to prepare the input with subword tokenization for BERT. You can follow the process in the [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification).

Here is also where you will decide on which BERT-based pre-trained model you will fine-tune, since you will need to match its tokenization.
Feel free to search Hugging Face for BERT variants or to use recommended ones in Hugging Face documentation. For Norwegian, you'll want a pretrained BERT model that can handle Norwegian (in Bokmål written form).

In [None]:
from transformers import AutoTokenizer

# FILL IN with the name of a BERT-based pretrained model from Hugging Face
pretrained_model =
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)

Subword tokenization will add special tokens such as `[CLS]` which we want the classifier to ignore.

It also splits some words into multiple tokens. We'll have to re-align those to assign just one part-of-speech tag to each word.

Fill in code here to do this alignment, as well as prepare a tokenized version of the dataset. You may adapt code from the [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification).

In [None]:
# FILL IN

# Prepare evaluation

Evaluation code is provided here.

Source: [Hugging Face token classification guide](https://huggingface.co/docs/transformers/en/tasks/token_classification)

In [None]:
!pip install seqeval
!pip install evaluate

import evaluate
seqeval = evaluate.load('seqeval')

In [None]:
import numpy as np

label_list = data['train'].features['upos'].feature.names
labels = data['train'][0]['upos']
labels = [label_list[i] for i in labels]

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# Train (fine-tune) the model

Fill in code here to load your pretrained model and do fine-tuning using the `Trainer` class or PyTorch.

In [None]:
# FILL IN

# Test performance

Fill in code here to evaluate your fine-tuned model's performance on the test set of the tokenized dataset.

You will be reporting accuracy in your report.

In [None]:
# FILL IN

# Run on an example sentence

Fill in code here to run your classifier on an example sentence of your choice for both English and Norwegian models. You will likely have to load these models from checkpoints created during training.

You will provide the predicted tags for example sentences in your report.

In [None]:
# FILL IN