# SICSS 2025: Active Learning Workshop

> Diletta Goglia, Uppsala University InfoLab | <diletta.goglia@it.uu.se> | BlueSky: @dilettagoglia.bsky.social

Welcome to the AL workshop, as part of the Computational Text Analaysis day! In this notebook, you will learn how to automatically label a dataset, starting from a small set of human-labeled texts. In particular, you will:

* train and evaluate a RoBERTa model for multi-label text annotation 
* fine-tune the model
* predict (generate) annotations for unlabelled texts

## Corpus description

We will use the data collected by Manika Lamba and Hendrik Erz in their paper "[Thanking the World](https://www.sciencedirect.com/science/article/pii/S2543925124000287)". They collected ~1200 acknowledgment sections from theses for a total fo ~20k individual sentences. They manually annotated a random sample of 900 sentences, according to the type of support they contain (academic, moral, finantial, technical, religious, library, access to data, or other), and trained a RoBERTa-base transformer model to annotate the remaining data (yes, AL!). In this notebook, we will go through all the necessary steps required for this final phase.

If you need help, please ask during the workshop or contact me via email :) have fun with AL!

# The Preliminaries: 

## Defining the Concept

The first step of the work consisted in manually annotating the training dataset according to the **support labels**. It is not part of the workshop, but we will use:

* the support labels: to train the classifier to assign them
* the file with human annotations (*AL_gold_data.tsv*)

In [None]:
SUPP_LABELS = [
"Academic",
"Moral",
"Tech",
"Data",
"Library",
"Finance",
"Religious",
"Unknown"
]

## Installing the necessary packages

Make sure to install packages according to how you have set up Python. If you use plain `pip`, here is how you can install them:

```bash
python -m pip install tqdm          # Used for progress bars
python -m pip install transformers  # To use the models
python -m pip install torch         # PyTorch for model handling
python -m pip install evaluate      # For validation metrics
python -m pip install matplotlib    # For plotting
python -m pip install seaborn       # Again, for plotting
```

## Import modules

In [None]:
import numpy as np
import pandas as pd
from tqdm.auto import tqdm 
from transformers import RobertaTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer 
import torch 
from torch import nn
import evaluate
import random
import matplotlib.pyplot as plt
import seaborn as sns

## Parameters setting

This phase is the initial setup step where you define key configuration values that control how your model will behave. This phase does not involve training or loading data yet! It’s just about defining your environment and behavior.

Many ML operations (e.g., shuffling and sampling) involve randomness. Setting a random seed ensures that your code produces the same results every time you run it, which is essential for reproducibility.

In [None]:
# Set seeds for reproducible results
seed = 1989
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.mps.manual_seed(seed)

In [None]:
# When running on CPU, this somehow makes sure training times do not degrade
# see: https://discuss.pytorch.org/t/training-time-gets-slower-and-slower-on-cpu/145483/3
torch.set_flush_denormal(True)

# Select the device
# If you have a MacBook (with a Silicon chip), you have "mps" available. On
# Windows or Linux, if you have an nVidia GPU, you have CUDA available.
# Otherwise, use the CPU.

device = torch.device("cpu") # Fallback: CPU

if torch.backends.mps.is_available():
  device = torch.device("mps")
elif torch.cuda.is_available():
  device = torch.device("cuda")

# Annotated Data Loading

Now we load the human-annotated dataset and we create both the training and the validation datasets for our model. Then we put the data into a format that our model can understand.

In [None]:
sentences: list[str] = list()
labels: list[int] = list()

In [None]:
def read_samples():
  """Reads in the gold data and yields tuples (sentence, labels)"""
  with open("AL_gold_data.tsv", "r", encoding="utf-8") as fp:
    next(fp) # Skip header
    for line in fp:
      cols = line.strip().split("\t")
      sentence = cols[0]
      label = np.argmax(np.asarray([int(x) for x in cols[1:9]]))
      yield (sentence, label)

In [None]:
for sentence, label in read_samples():
    sentences.append(sentence)
    labels.append(label)

In [None]:
# We create a random train/valid split
rand = np.random.default_rng()
train_idx: list[int] = rand.choice(len(sentences), size=round(len(sentences) * 0.8), replace = False)
valid_idx = set(range(len(sentences))).difference(set(train_idx))
print(f"Datasets prepared! We are training with {len(train_idx)} training and {len(valid_idx)} validation samples.")

In [None]:
# This class helps us organize inputs and labels into a format that PyTorch models understand.

class CustomDataset(torch.utils.data.Dataset):
  """Basically copied verbatim from https://huggingface.co/transformers/v3.5.1/custom_datasets.html"""
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, idx):
    item = {key: val[idx] for key, val in self.encodings.items()}
    item['labels'] = self.labels[idx]
    return item

  def __len__(self):
    return len(self.labels)

In [None]:
# NOTE: This will download the roberta-base tokenizer model to your device.
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

In [None]:
# This function converts a list of raw text sentences and their labels into a PyTorch dataset

def sentences_to_data_loader (sentences: list[str], labels: list[int]):
  """Takes a list of sentences, a batch size and a list of integer labels and constructs a dataset from that."""
  tok = tokenizer(sentences, padding="max_length", truncation=True, return_tensors='pt', return_attention_mask=True)
  return CustomDataset(tok, torch.tensor(labels))


In [None]:
# Now create two datasets with this information:
support_train = sentences_to_data_loader([sentences[i] for i in train_idx], labels=[labels[i] for i in train_idx])
support_valid = sentences_to_data_loader([sentences[i] for i in valid_idx], labels=[labels[i] for i in valid_idx])

# Model Training Evaluation

We set up the metrics used to validate the performance of our model.

In [None]:
# How to determine the best model (ideally f1, otherwise loss works)
metric = 'f1'
is_greater_better = True

f1_metric = evaluate.load('f1')
acc_metric = evaluate.load('accuracy')

In [None]:
def compute_metrics_support(eval_pred):
    predictions, labels = eval_pred
    predictions = torch.from_numpy(predictions)
    predictions = nn.functional.softmax(predictions, dim=-1)
    predictions = np.argmax(predictions, axis=-1)

    # Calculates one F1 per label, so we should have an array with 8 elements
    f1 = f1_metric.compute(predictions=predictions, references=labels, average=None)['f1']
    acc = acc_metric.compute(predictions=predictions, references=labels)['accuracy']

    # NOTE: We define the F1 here as the average score of all categories
    avg_f1 = np.mean(f1)

    return { 'f1': avg_f1, 'accuracy': acc }

# Finetuning

We can finally train the RoBERTa model ! The following block sets up all the training parameters, downloads a pretrained language model (RoBERTa), wraps everything into a Trainer class, and then trains the model on the labeled dataset. It saves the best version automatically.

In [None]:
args = TrainingArguments(
    output_dir="model",
    eval_strategy = "epoch", # Print results after each epoch
    save_strategy = "epoch", # If loading best model, save + eval need to match
    per_device_train_batch_size=8, # Default is 8
    per_device_eval_batch_size=8,
    num_train_epochs=15.0, # default 3
    learning_rate = 5e-05, # default: 5e-05
    adam_epsilon = 1e-8, # Taken from Rubing's script
    load_best_model_at_end = True, # Default: False
    metric_for_best_model = metric,
    greater_is_better = is_greater_better,
    # use_mps_device=True #  <-- UNCOMMENT this line if you are using a MacOS machine
  )

# NOTE: This will download the RoBERTa Base model to your machine
model = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=len(SUPP_LABELS), # How many labels should the model learn to assign?
    problem_type="single_label_classification"
  )

trainer = Trainer(
  model=model,
  args=args,
  train_dataset=support_train,
  eval_dataset=support_valid,
  compute_metrics=compute_metrics_support
)

print("Training support category model!")
trainer.train()
trainer.save_model("finetuned_model")
print("Model trained!")


# Predictions

We are ready to annotate the rest of the texts. We first load the fine-tuned model from the folder where we saved it after training.
This restores all the learned weights so we can use the model for making predictions.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("finetuned_model")
model.to(device)

## Annotating

We load the corpus that we want to annotate, and we predicts labels for each sentence using our fine-tuned model. We write the results to a file (*AL_predictions.tsv*).

In [None]:
corpus = pd.read_csv("AL_corpus.tsv", sep="\t")
print(f"Corpus size: {len(corpus)} sentences.")

In [None]:
corpus

In [None]:
with open("AL_predictions.tsv", "w") as fp:
    fp.write(f"year\tsentence\tsupport_label\n")
    for row in tqdm(corpus.itertuples(), total=len(corpus), desc="Predicting", dynamic_ncols=True):
        tok = tokenizer(row.sentence, padding="max_length", truncation=True, return_tensors='pt')
        tok = tok.to(device)

        output = model(**tok)
        predictions = output.logits.detach().squeeze(0).cpu().numpy()
        supp_label = np.argmax(predictions)

        fp.write(f"{row.year}\t{row.sentence}\t{SUPP_LABELS[supp_label]}\n")
        fp.flush() # Make sure we can watch as the file fills

    print("Prediction done! You can find the predictions in the ''AL_predictions.tsv'' file.")

## Visualizing the result

In [None]:
# read AL_predictions.tsv as pandas DataFrame
predictions_df = pd.read_csv("AL_predictions.tsv", sep="\t", encoding="utf-8")
print(predictions_df.head(10))


In [None]:
# we now plot the distribution of support labels
predictions_df['support_label'].value_counts().plot(kind='bar')
plt.title("Distribution of support labels")
plt.xlabel("Label")
plt.ylabel("Count")
plt.tight_layout()

**What can you observe?** Are some categories much more common than others? Does anything look surprising?

In [None]:
counts = predictions_df.groupby(["year", "support_label"]).size().reset_index(name="count")

# Line plot of support label frequency over time
plt.figure(figsize=(10, 6))
sns.lineplot(data=counts, x="year", y="count", hue="support_label", marker="o", palette="Set2")

plt.title("Support label frequency over time")
plt.xlabel("Year")
plt.ylabel("Number of sentences")
plt.grid(True, linestyle="--", alpha=0.5)
plt.xticks(sorted(predictions_df["year"].unique()))
plt.tight_layout()
plt.legend(title="Label")

What can you observe from this temporal perspective?
Do certain categories appear more in earlier or later years? Are there any noticeable shifts, spikes, or disappearances over time?

## Extra: Handling authentication with the Hugging Face Hub 

Hugging Face is a company and open-source community that provides tools, models, and libraries for working with machine learning—especially NLP and LLMs.

To use Hugging Face’s models (especially large ones or those requiring authentication), you need an access token. This token links your Hugging Face account to your code securely.

**How to Create a Hugging Face Access Token:**

* Create a [Hugging Face account](https://huggingface.co) (if you don't already have one).
* After logging in, go to your Access Tokens page and click on "new token".
* Choose a name (e.g., sicss-token), select the role, and click "create".
* Copy the token and past it in the cell below.
* **Never share your token publicly!**

In [None]:
from huggingface_hub import login
my_token = "PAST YOUR TOKEN HERE"
login(token=my_token)

# Conclusion

Thank you for following along with this workshop!

If you have further questions regarding do not hesitate to contact me: <diletta.goglia@it.uu.se> | @dilettagoglia on social media.

Special thanks to Hendrik for his help and inspiration for this notebook!