## Summary

This notebook demonstrates the full pipeline for training and evaluating transformer-based NER models on Swedish medical text. The experiments provide insight into how different design choices affect token-level sequence labeling performance.


In [None]:
import numpy as np
from datasets import load_dataset, DatasetDict, concatenate_datasets
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import sys, torch, transformers, datasets
print("Python:", sys.version)
print("Executable:", sys.executable)
print("CUDA:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
print("Transformers:", transformers.__version__)
print("Datasets:", datasets.__version__)




Python: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Executable: c:\Users\oskar\Desktop\Selected Methods\Selected-Methods---Assignment3-main\Selected-Methods---Assignment3-main\venv\Scripts\python.exe
CUDA: True
GPU: NVIDIA GeForce RTX 4080 SUPER
Transformers: 4.57.3
Datasets: 4.4.1
<class 'transformers.training_args.TrainingArguments'>


## Dataset Loading

The sentiment classification dataset is loaded from the Hugging Face Datasets library. The dataset consists of Swedish text samples annotated with binary sentiment labels, where 1 corresponds to positive sentiment and 0 corresponds to negative sentiment.

The original dataset includes predefined training, validation, and test splits, which are merged and re-split to follow the assignment instructions.


In [2]:
ds = load_dataset("mteb/SwedishSentimentClassification")



## Data Merging and Splitting

To ensure consistency with the experimental setup used in the NER task, the original training, validation, and test splits are merged into a single dataset. This combined dataset is then split as follows:

- 80% of the data is used for training and validation
- 20% is held out as a test set and used only for final evaluation
- The training–validation portion is further split into 90% training data and 10% validation data

This ensures that the test set remains isolated throughout model development.


In [3]:
full_dataset = concatenate_datasets([
    ds["train"],
    ds["validation"],
    ds["test"]
])

len(full_dataset)


66185

In [4]:
split_80_20 = full_dataset.train_test_split(
    test_size=0.2,
    seed=42
)

trainval_ds = split_80_20["train"]
test_ds = split_80_20["test"]

split_90_10 = trainval_ds.train_test_split(
    test_size=0.1,
    seed=42
)

train_ds = split_90_10["train"]
val_ds = split_90_10["test"]

dataset = DatasetDict({
    "train": train_ds,
    "validation": val_ds,
    "test": test_ds
})

dataset


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 47653
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5295
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 13237
    })
})

In [5]:
dataset["train"][0]

{'text': 'God mat. Trevligt ställe. Jag gillar stället jättemycket. Speciellt sommarhalvåret då det går att sitta ute.',
 'label': 1}

In [6]:
models = [
    "KB/bert-base-swedish-cased",
    "google-bert/bert-base-multilingual-cased"
]

## Text Preprocessing and Tokenization

Each text sample is tokenized using the subword tokenizer associated with the pretrained transformer model. The input sequences are padded and truncated to a fixed maximum length to allow for batch processing.

Since sentiment classification is a text-level task, each input sequence is associated with a single sentiment label rather than token-level labels.


In [7]:
def tokenize_function(examples, tokenizer):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=True,
        max_length=64
    )

In [8]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average="weighted"
    )
    acc = accuracy_score(labels, preds)

    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }


In [None]:
def train_text_classifier(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    encoded_ds = dataset.map(
        lambda x: tokenize_function(x, tokenizer),
        batched=True
    )

    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2
    )

    args = TrainingArguments(
        output_dir=f"./sentiment_{model_name.replace('/', '_')}",
        eval_strategy="epoch",
        save_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=64,
        per_device_eval_batch_size=64,
        num_train_epochs=5,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        logging_steps=50,
        fp16=True
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=encoded_ds["train"],
        eval_dataset=encoded_ds["validation"],
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.train()
    test_metrics = trainer.evaluate(encoded_ds["test"])
    return test_metrics




In [10]:
results = {}

for model in models:
    print(f"\nTraining sentiment model: {model}")
    metrics = train_text_classifier(model)
    results[model] = metrics


Training sentiment model: KB/bert-base-swedish-cased


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/491 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Map:   0%|          | 0/47653 [00:00<?, ? examples/s]

Map:   0%|          | 0/5295 [00:00<?, ? examples/s]

Map:   0%|          | 0/13237 [00:00<?, ? examples/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
for model, metrics in results.items():
    print(f"\nModel: {model}")
    for k, v in metrics.items():
        print(f"{k}: {v:.4f}")