# Bert baseline for POLAR

## Introduction

In this part of the starter notebook, we will take you through the process of all three Subtasks.

## Subtask 1 - Polarization detection

This is a binary classification to determine whether a post contains polarized content (Polarized or Not Polarized).

In [1]:
# !unzip dev_phase.zip

## Imports

In [2]:
import pandas as pd

from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

import torch

from sklearn.metrics import f1_score

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from torch.utils.data import Dataset

In [3]:
import wandb

# Disable wandb logging for this script
wandb.init(mode="disabled")

## Data Import

The training data consists of a short text and binary labels

The data is structured as a CSV file with the following fields:
- id: a unique identifier for the sample
- text: a sentence or short text
- polarization:  1 text is polarized, 0 text is not polarized

The data is in all three subtask folders the same but only containing the labels for the specific task.

In [4]:
# Load the training and validation data for subtask 1

from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('./dev_phase/subtask1/train/eng.csv')

# Split into train (80%) and val (20%), stratify if possible
train, val = train_test_split(
    data,
    test_size=0.2,
    random_state=42,
    stratify=data["polarization"] if "polarization" in data.columns else None
)

train.head()

Unnamed: 0,id,text,polarization
1350,eng_f308ca27dccc22549e50f1042ceb1df8,And where did I say h8 or xenophobia?,0
2605,eng_b1d3f6b9a86d738b0dbe5a5d891f79ef,WOW bad will the RedWAVE bloodbath be for the ...,1
1754,eng_5836d1c4db446e1a501fe30fb1b2615c,"Breitbart is racist trash, for revealing Racis...",1
634,eng_789191ae3e9c3a9c6de6565559820379,"Israeli Bedouins, lacking government protectio...",0
2700,eng_737e402f83d6f36f2023195aecac92df,"As Joe Biden leaves the White House today, I r...",0


# Dataset
-  Create a pytorch class for handling data
-  Wrapping the raw texts and labels into a format that Huggingfaceâ€™s Trainer can use for training and evaluation

In [5]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
  def __init__(self,texts,labels,tokenizer,max_length =128):
    self.texts=texts
    self.labels=labels
    self.tokenizer= tokenizer
    self.max_length = max_length # Store max_length

  def __len__(self):
    return len(self.texts)

  def __getitem__(self,idx):
    text=self.texts[idx]
    label=self.labels[idx]
    encoding=self.tokenizer(text,truncation=True,padding=False,max_length=self.max_length,return_tensors='pt')

    # Ensure consistent tensor conversion for all items
    item = {key: encoding[key].squeeze() for key in encoding.keys()}
    item['labels'] = torch.tensor(label, dtype=torch.long)
    return item

Now, we'll tokenize the text data and create the datasets using `bert-base-uncased` as the tokenizer.

In [6]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-base')

# Create datasets
train_dataset = PolarizationDataset(train['text'].tolist(), train['polarization'].tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val['polarization'].tolist(), tokenizer)



Next, we'll load the pre-trained `bert-base-uncased` model for sequence classification. Since this is a binary classification task (Polarized/Not Polarized), we set `num_labels=2`.

In [7]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-base', num_labels=2)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, we'll define the training arguments and the evaluation metric. We'll use macro F1 score for evaluation.

In [8]:
# Define metrics function
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

# Define training arguments
training_args = TrainingArguments(
        output_dir=f"./",
        num_train_epochs=3,
        learning_rate=2e-5,
        per_device_train_batch_size=64,
        per_device_eval_batch_size=8,
        eval_strategy="epoch",
        save_strategy="no",
        logging_steps=100,
        disable_tqdm=False
    )


Finally, we'll initialize the `Trainer` and start training.

In [9]:
# Initialize the Trainer
trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
    data_collator=DataCollatorWithPadding(tokenizer) # Data collator for dynamic padding
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set: {eval_results['eval_f1_macro']}")

Epoch,Training Loss,Validation Loss,F1 Macro
1,No log,0.4716,0.777037
2,No log,0.42274,0.795289
3,0.462100,0.428515,0.801731


Macro F1 score on validation set: 0.801730707135938


In [10]:
from tqdm import tqdm

In [11]:
test_dataset = pd.read_csv('./dev_phase/subtask1/dev/eng.csv')
labels = []
probs_list = []
labels = []
for text in tqdm(test_dataset['text']):
    # Run the model
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=1).cpu().numpy()[0]
        pred_label = logits.argmax(dim=1).cpu().numpy()[0]
        labels.append(pred_label)
        probs_list.append(probabilities)

  0%|          | 0/160 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 160/160 [00:04<00:00, 38.98it/s]


In [12]:
# print the results row by row in csv format
for i in range(len(labels)):
    print(f"{test_dataset['id'][i]},{labels[i]}")


eng_f66ca14d60851371f9720aaf4ccd9b58,0
eng_3a489aa7fed9726aa8d3d4fe74c57efb,0
eng_95770ff547ea5e48b0be00f385986483,0
eng_2048ae6f9aa261c48e6d777bcc5b38bf,0
eng_07781aa88e61e7c0a996abd1e5ea3a20,0
eng_153d96f9dc27f0602c927223404d94b5,1
eng_4ab5a4cc5c87d0af9cf4b80c301647bf,0
eng_e75a95ba52930d6d72d503ab9469eb29,0
eng_eb8fab668668f9959cafdecbfc0f081a,0
eng_702724dc168d600e788d775c8e651f36,0
eng_0efa1a3567443075db38c7ce2dcca571,0
eng_d08d4243fd2786795df39c1a65dacac7,0
eng_79fa99ba6989fb61ec127c6c99fc2343,1
eng_30981038b71c210e97731d90e86038c5,0
eng_b75e663a6fdc280171b6385b99306a3c,0
eng_de2baffcfc59b672905e6f2694f672f6,0
eng_d887794cce49564890a2552cdfb745d2,0
eng_097b78cf6e209e6778e1fbda10d28b8d,0
eng_6d29b7c72a091789d06d92157688e07f,0
eng_00c797e70f1a2f50f1c59655f08581e1,1
eng_f1e66eab0b9b2c4a83103eb65a67046e,0
eng_4bf47d33b804477375ade2a151cb2614,0
eng_6b4616355cbed93c122ca3e2369b3a0d,0
eng_f011b534fb34efc8c574d2e16d3a95f5,0
eng_8376c26da2f537abeada29ed927d3e01,0
eng_af8597fa9be96fdfabb05

# Subtask 2: Polarization Type Classification
Multi-label classification to identify the target of polarization as one of the following categories: Gender/Sexual, Political, Religious, Racial/Ethnic, or Other.
For this task we will load the data for subtask 2.

In [13]:
# Load the data
data = pd.read_csv('./dev_phase/subtask2/train/eng.csv')

# Split into train (80%) and val (20%), stratify if possible
train, val = train_test_split(
    data,
    test_size=0.2,
    random_state=42,
)

In [14]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length # Store max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, truncation=True, padding=False, max_length=self.max_length, return_tensors='pt')

        # Ensure consistent tensor conversion for all items
        item = {key: encoding[key].squeeze() for key in encoding.keys()}
        # CHANGE THIS LINE: Use torch.float instead of torch.long for multi-label classification
        item['labels'] = torch.tensor(label, dtype=torch.float)
        return item


In [15]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-base')

# Create train and Test dataset for multilabel
train_dataset = PolarizationDataset(train['text'].tolist(), train[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)
dev_dataset = PolarizationDataset(val['text'].tolist(), val[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)




In [16]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-base', num_labels=5, problem_type="multi_label_classification") # 5 labels

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# Define metrics function for multi-label classification
def compute_metrics_multilabel(p):
    # Sigmoid the predictions to get probabilities
    probs = torch.sigmoid(torch.from_numpy(p.predictions))
    # Convert probabilities to predicted labels (0 or 1)
    preds = (probs > 0.5).int().numpy()
    # Compute macro F1 score
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

# Define training arguments
training_args = TrainingArguments(
    output_dir=f"./",
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=100,
    disable_tqdm=False
)

In [18]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics_multilabel,  # Use the new metrics function
    data_collator=DataCollatorWithPadding(tokenizer)
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set for Subtask 2: {eval_results['eval_f1_macro']}")

A ConfigError was raised whilst setting the number of model parameters in Weights & Biases config.


Epoch,Training Loss,Validation Loss,F1 Macro
1,0.2296,0.220426,0.132258
2,0.1851,0.20162,0.138462
3,0.1545,0.217973,0.192754


Macro F1 score on validation set for Subtask 2: 0.19275362318840578


In [19]:
from tqdm import tqdm
test_dataset = pd.read_csv('./dev_phase/subtask2/dev/eng.csv')
labels = []
probs_list = []
labels = []
for text in tqdm(test_dataset['text']):
    # Run the model
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.sigmoid(logits).cpu().numpy()[0]
        pred_label = (probabilities > 0.5).astype(int)
        labels.append(pred_label)
        probs_list.append(probabilities)

  0%|          | 0/160 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 160/160 [00:04<00:00, 38.86it/s]


In [20]:
# print the results row by row in csv format
for i in range(len(labels)):
    print(f"{test_dataset['id'][i]}," + ",".join(str(x) for x in labels[i]))


eng_f66ca14d60851371f9720aaf4ccd9b58,0,0,0,0,0
eng_3a489aa7fed9726aa8d3d4fe74c57efb,0,0,0,0,0
eng_95770ff547ea5e48b0be00f385986483,0,0,0,0,0
eng_2048ae6f9aa261c48e6d777bcc5b38bf,0,0,0,0,0
eng_07781aa88e61e7c0a996abd1e5ea3a20,0,0,0,0,0
eng_153d96f9dc27f0602c927223404d94b5,0,0,0,0,0
eng_4ab5a4cc5c87d0af9cf4b80c301647bf,0,0,0,0,0
eng_e75a95ba52930d6d72d503ab9469eb29,0,0,0,0,0
eng_eb8fab668668f9959cafdecbfc0f081a,0,0,0,0,0
eng_702724dc168d600e788d775c8e651f36,0,0,0,0,0
eng_0efa1a3567443075db38c7ce2dcca571,0,0,0,0,0
eng_d08d4243fd2786795df39c1a65dacac7,0,0,0,0,0
eng_79fa99ba6989fb61ec127c6c99fc2343,0,1,0,0,0
eng_30981038b71c210e97731d90e86038c5,0,0,0,0,0
eng_b75e663a6fdc280171b6385b99306a3c,0,0,0,0,0
eng_de2baffcfc59b672905e6f2694f672f6,0,0,0,0,0
eng_d887794cce49564890a2552cdfb745d2,0,0,0,0,0
eng_097b78cf6e209e6778e1fbda10d28b8d,0,0,0,0,0
eng_6d29b7c72a091789d06d92157688e07f,0,0,0,0,0
eng_00c797e70f1a2f50f1c59655f08581e1,0,1,0,1,0
eng_f1e66eab0b9b2c4a83103eb65a67046e,0,0,0,0,0
eng_4bf47d33b

# Subtask 3: Manifestation Identification
Multi-label classification to classify how polarization is expressed, with multiple possible labels including Vilification, Extreme Language, Stereotype, Invalidation, Lack of Empathy, and Dehumanization.



In [21]:
# Load the data
data = pd.read_csv('./dev_phase/subtask3/train/eng.csv')

# Split into train (80%) and val (20%), stratify if possible
train, val = train_test_split(
    data,
    test_size=0.2,
    random_state=42,
)

In [22]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length # Store max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, truncation=True, padding=False, max_length=self.max_length, return_tensors='pt')

        # Ensure consistent tensor conversion for all items
        item = {key: encoding[key].squeeze() for key in encoding.keys()}
        # CHANGE THIS LINE: Use torch.float instead of torch.long for multi-label classification
        item['labels'] = torch.tensor(label, dtype=torch.float)
        return item

In [23]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-base')

# Create train and Test dataset for multilabel
train_dataset = PolarizationDataset(train['text'].tolist(), train[['vilification','extreme_language','stereotype','invalidation','lack_of_empathy','dehumanization']].values.tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val[['vilification','extreme_language','stereotype','invalidation','lack_of_empathy','dehumanization']].values.tolist(), tokenizer)



In [24]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-v3-base', num_labels=6, problem_type="multi_label_classification") # use 6 labels

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
# Define training arguments
training_args = TrainingArguments(
    output_dir=f"./",
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=100,
    disable_tqdm=False
)

# Define metrics function for multi-label classification
def compute_metrics_multilabel(p):
    # Sigmoid the predictions to get probabilities
    probs = torch.sigmoid(torch.from_numpy(p.predictions))
    # Convert probabilities to predicted labels (0 or 1)
    preds = (probs > 0.5).int().numpy()
    # Compute macro F1 score
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

In [26]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics_multilabel,  # Use the new metrics function
    data_collator=DataCollatorWithPadding(tokenizer)
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set for Subtask 3: {eval_results['eval_f1_macro']}")

A ConfigError was raised whilst setting the number of model parameters in Weights & Biases config.


Epoch,Training Loss,Validation Loss,F1 Macro
1,0.3769,0.380339,0.064461
2,0.3357,0.328302,0.219581
3,0.3067,0.345058,0.311165


Macro F1 score on validation set for Subtask 3: 0.3111648990997745


In [27]:
from tqdm import tqdm

In [28]:
test_dataset = pd.read_csv('./dev_phase/subtask3/dev/eng.csv')
labels = []
probs_list = []
labels = []
for text in tqdm(test_dataset['text']):
    # Run the model
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.sigmoid(logits).cpu().numpy()[0]
        pred_label = (probabilities > 0.5).astype(int)
        labels.append(pred_label)
        probs_list.append(probabilities)

  0%|          | 0/160 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 160/160 [00:04<00:00, 38.67it/s]


In [30]:
# print the results row by row in csv format
for i in range(len(labels)):
    print(f"{test_dataset['id'][i]}," + ",".join(str(x) for x in labels[i]))


eng_f66ca14d60851371f9720aaf4ccd9b58,0,0,0,0,0,0
eng_3a489aa7fed9726aa8d3d4fe74c57efb,0,0,0,0,0,0
eng_95770ff547ea5e48b0be00f385986483,0,0,0,0,0,0
eng_2048ae6f9aa261c48e6d777bcc5b38bf,0,0,0,0,0,0
eng_07781aa88e61e7c0a996abd1e5ea3a20,0,0,0,0,0,0
eng_153d96f9dc27f0602c927223404d94b5,0,0,0,0,0,0
eng_4ab5a4cc5c87d0af9cf4b80c301647bf,0,0,0,0,0,0
eng_e75a95ba52930d6d72d503ab9469eb29,0,0,0,0,0,0
eng_eb8fab668668f9959cafdecbfc0f081a,0,0,0,0,0,0
eng_702724dc168d600e788d775c8e651f36,0,0,0,0,0,0
eng_0efa1a3567443075db38c7ce2dcca571,0,0,0,0,0,0
eng_d08d4243fd2786795df39c1a65dacac7,0,0,0,0,0,0
eng_79fa99ba6989fb61ec127c6c99fc2343,1,0,0,0,0,0
eng_30981038b71c210e97731d90e86038c5,0,0,0,0,0,0
eng_b75e663a6fdc280171b6385b99306a3c,0,0,0,0,0,0
eng_de2baffcfc59b672905e6f2694f672f6,0,0,0,0,0,0
eng_d887794cce49564890a2552cdfb745d2,0,0,0,0,0,0
eng_097b78cf6e209e6778e1fbda10d28b8d,0,0,0,0,0,0
eng_6d29b7c72a091789d06d92157688e07f,0,0,0,0,0,0
eng_00c797e70f1a2f50f1c59655f08581e1,1,1,0,0,0,0
eng_f1e66eab0b9b2c4a