# Bert baseline for POLAR

## Introduction

In this part of the starter notebook, we will take you through the process of all three Subtasks.

## Subtask 1 - Polarization detection

This is a binary classification to determine whether a post contains polarized content (Polarized or Not Polarized).

In [1]:
# Install gdown if needed
# Replace the ID below with your actual file ID from the Drive link
# (The ID is the long string of random characters in the URL)
file_id = '1Cvdkk_AZQzM5rJYhV4Nq-8bXvJrF8t4z'
url = f'https://drive.google.com/uc?id={file_id}'
output = 'dev_phase.zip'

!gdown {url} -O {output}

!unzip {output}

# Delete __MACOSX directory (if exists) and the dev_phase.zip file (cleanup)
import os
import shutil

if os.path.exists("__MACOSX"):
    shutil.rmtree("__MACOSX")

if os.path.exists("dev_phase.zip"):
    os.remove("dev_phase.zip")

Downloading...
From: https://drive.google.com/uc?id=1Cvdkk_AZQzM5rJYhV4Nq-8bXvJrF8t4z
To: /content/dev_phase.zip
100% 10.1M/10.1M [00:00<00:00, 20.8MB/s]
Archive:  dev_phase.zip
   creating: dev_phase/
  inflating: __MACOSX/._dev_phase    
   creating: dev_phase/subtask2/
  inflating: __MACOSX/dev_phase/._subtask2  
   creating: dev_phase/subtask3/
  inflating: __MACOSX/dev_phase/._subtask3  
  inflating: dev_phase/.DS_Store     
  inflating: __MACOSX/dev_phase/._.DS_Store  
   creating: dev_phase/subtask1/
  inflating: __MACOSX/dev_phase/._subtask1  
  inflating: dev_phase/subtask2/.DS_Store  
  inflating: __MACOSX/dev_phase/subtask2/._.DS_Store  
   creating: dev_phase/subtask2/train/
  inflating: __MACOSX/dev_phase/subtask2/._train  
   creating: dev_phase/subtask2/dev/
  inflating: __MACOSX/dev_phase/subtask2/._dev  
  inflating: dev_phase/subtask3/.DS_Store  
  inflating: __MACOSX/dev_phase/subtask3/._.DS_Store  
   creating: dev_phase/subtask3/train/
  inflating: __MACOSX/dev_pha

## Imports

In [None]:
import pandas as pd

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

import torch

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from torch.utils.data import Dataset
from tqdm.auto import tqdm

In [2]:
import wandb

# Disable wandb logging for this script
wandb.init(mode="disabled")

In [3]:
import os, random, numpy as np
import torch

seed = 42

# python
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)

# numpy
np.random.seed(seed)

# pytorch
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# deterministic behavior
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

## Data Import

The training data consists of a short text and binary labels

The data is structured as a CSV file with the following fields:
- id: a unique identifier for the sample
- text: a sentence or short text
- polarization:  1 text is polarized, 0 text is not polarized

The data is in all three subtask folders the same but only containing the labels for the specific task.

In [23]:
# Load the training and validation data for subtask 1

train = pd.read_csv('./dev_phase/subtask1/train/eng.csv')
test = pd.read_csv('./dev_phase/subtask1/dev/eng.csv')

train.head()

Unnamed: 0,id,text,polarization
0,eng_973938b90b0ff5d87d35a582f83f5c89,is defending imperialism in the dnd chat,0
1,eng_07dfd4600426caca6e2c5883fcbea9ea,Still playing with this. I am now following Ra...,0
2,eng_f14519ff2302b6cd47712073f13bc461,.senate.gov Theres 3 groups out there Republic...,0
3,eng_e48b7e7542faafa544ac57b64bc80daf,"""ABC MD, David Anderson, said the additional f...",0
4,eng_7c581fb77bce8033aeba3d6dbd6273eb,"""bad people"" I have some conservative values s...",0


# Dataset
-  Create a pytorch class for handling data
-  Wrapping the raw texts and labels into a format that Huggingfaceâ€™s Trainer can use for training and evaluation

In [24]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
  def __init__(self,texts,labels,tokenizer,max_length =128):
    self.texts=texts
    self.labels=labels
    self.tokenizer= tokenizer
    self.max_length = max_length # Store max_length

  def __len__(self):
    return len(self.texts)

  def __getitem__(self,idx):
    text=self.texts[idx]
    label=self.labels[idx]
    encoding=self.tokenizer(text,truncation=True,padding=False,max_length=self.max_length,return_tensors='pt')

    # Ensure consistent tensor conversion for all items
    item = {key: encoding[key].squeeze() for key in encoding.keys()}
    item['labels'] = torch.tensor(label, dtype=torch.long)
    return item

Now, we'll tokenize the text data and create the datasets using `bert-base-uncased` as the tokenizer.

In [25]:
from sklearn.model_selection import train_test_split
# Load the tokenizer
model_names = ['bert-base-uncased', "UBC-NLP/MARBERTv2"]
model_name = model_names[0]
tokenizer = AutoTokenizer.from_pretrained(model_name)

texts_train, texts_val, labels_train, labels_val = train_test_split(
    train['text'].tolist(),
    train['polarization'].tolist(),
    test_size=0.2,
    random_state=42,
    stratify=train['polarization'].tolist(),  # if labels are imbalanced
)

train_dataset = PolarizationDataset(texts_train, labels_train, tokenizer)
val_dataset = PolarizationDataset(texts_val, labels_val, tokenizer)
# test_dataset = PolarizationDataset(test['text'].tolist(), test['polarization'].tolist(), tokenizer)

Next, we'll load the pre-trained `bert-base-uncased` model for sequence classification. Since this is a binary classification task (Polarized/Not Polarized), we set `num_labels=2`.

In [26]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, we'll define the training arguments and the evaluation metric. We'll use macro F1 score for evaluation.

In [None]:
# Define metrics function
# Primary Metric F1 MACRO SCORE
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        'f1_macro': f1_score(p.label_ids, preds, average='macro'),
        'accuracy': accuracy_score(p.label_ids, preds),
        'precision': precision_score(p.label_ids, preds, average='binary'),
        'recall': recall_score(p.label_ids, preds, average='binary'),
        'f1_binary': f1_score(p.label_ids, preds, average='binary'),
        'f1_micro': f1_score(p.label_ids, preds, average='micro')
    }

# Define training arguments
training_args = TrainingArguments(
        output_dir=f"./",
        num_train_epochs=3,
        learning_rate=2e-5,
        per_device_train_batch_size=64,
        per_device_eval_batch_size=8,
        eval_strategy="epoch",
        save_strategy="no",
        logging_steps=100,
        disable_tqdm=False
    )


Finally, we'll initialize the `Trainer` and start training.

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
    data_collator=DataCollatorWithPadding(tokenizer) # Data collator for dynamic padding
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Validation Results:\nAccuracy: {eval_results['eval_accuracy']:.4f}\nPrecision: {eval_results['eval_precision']:.4f}\nRecall: {eval_results['eval_recall']:.4f}\nF1 (binary): {eval_results['eval_f1_binary']:.4f}\nF1 (macro): {eval_results['eval_f1_macro']:.4f}\nF1 (micro): {eval_results['eval_f1_micro']:.4f}")

Epoch,Training Loss,Validation Loss,F1 Macro
1,No log,0.492411,0.718954
2,No log,0.468928,0.744665
3,0.439700,0.468923,0.738514


Macro F1 score on validation set: 0.7385141903793659


In [29]:
test_dataset = pd.read_csv('./dev_phase/subtask1/dev/eng.csv')
labels = []
probs_list = []
labels = []
for text in tqdm(test_dataset['text']):
    # Run the model
    print(text)
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=1).cpu().numpy()[0]
        pred_label = logits.argmax(dim=1).cpu().numpy()[0]
        labels.append(pred_label)
        probs_list.append(probabilities)

  0%|          | 0/133 [00:00<?, ?it/s]

God is with Ukraine and Zelensky
4 Dems, 2 Republicans Luzerne County Council seatsDallas
Abuse Survivor Recounts Her Struggles at YWCA Event
After Rwanda, another deportation camp disaster
Another plea in Trump election interference probe
any number of southern red states tbh
Breitbart is the new age grocery aisle tabloid
Congressman talks border security, debit limit
Cooke Co. GOP incumbents announce reelection runs
Could one overwhelm the iron dome?
Dan PatrickDade Phelan feud in Texas GOP threatens
Democrats meet in Plymouth for moderated gubernatorial
Election fraud did not go away, it is here. Every American must vote to overcome the election fraud. 2022 and 2024.
Faculty , educate students on the IsraelHamas war
GOP defense budgets abortion, diversity limits draw Biden
GOP lawmakers ramp up pushback on Snake River dam
Guilford County GOP Congratulates New School Board
Hamas has nobody else to rescue them. Its over! Hamas says Gaza ceasefire deal closer than ever before Qatar, al

In [30]:
# print the results row by row in csv format
for i in range(len(labels)):
    print(f"{test_dataset['id'][i]},{labels[i]}, {probs_list[i]}")


eng_f66ca14d60851371f9720aaf4ccd9b58,0, [0.6541927 0.3458073]
eng_3a489aa7fed9726aa8d3d4fe74c57efb,0, [0.9624602  0.03753977]
eng_95770ff547ea5e48b0be00f385986483,0, [0.9794422  0.02055784]
eng_2048ae6f9aa261c48e6d777bcc5b38bf,0, [0.9325723  0.06742771]
eng_07781aa88e61e7c0a996abd1e5ea3a20,0, [0.9651683  0.03483172]
eng_153d96f9dc27f0602c927223404d94b5,0, [0.9345434  0.06545668]
eng_4ab5a4cc5c87d0af9cf4b80c301647bf,1, [0.27487347 0.7251265 ]
eng_e75a95ba52930d6d72d503ab9469eb29,0, [0.9790981  0.02090194]
eng_eb8fab668668f9959cafdecbfc0f081a,0, [0.9725691  0.02743092]
eng_702724dc168d600e788d775c8e651f36,0, [0.68961114 0.31038886]
eng_0efa1a3567443075db38c7ce2dcca571,0, [0.97684264 0.02315729]
eng_d08d4243fd2786795df39c1a65dacac7,0, [0.98095924 0.01904074]
eng_79fa99ba6989fb61ec127c6c99fc2343,1, [0.24789402 0.75210595]
eng_30981038b71c210e97731d90e86038c5,0, [0.95542306 0.04457694]
eng_b75e663a6fdc280171b6385b99306a3c,0, [0.97291416 0.0270858 ]
eng_de2baffcfc59b672905e6f2694f672f6,0, [0

# Subtask 2: Polarization Type Classification
Multi-label classification to identify the target of polarization as one of the following categories: Gender/Sexual, Political, Religious, Racial/Ethnic, or Other.
For this task we will load the data for subtask 2.

In [15]:
train = pd.read_csv('./dev_phase/subtask2/train/eng.csv')
val = pd.read_csv('./dev_phase/subtask2/train/eng.csv')
train.head()

Unnamed: 0,id,text,political,racial/ethnic,religious,gender/sexual,other
0,eng_973938b90b0ff5d87d35a582f83f5c89,is defending imperialism in the dnd chat,0,0,0,0,0
1,eng_07dfd4600426caca6e2c5883fcbea9ea,Still playing with this. I am now following Ra...,0,0,0,0,0
2,eng_f14519ff2302b6cd47712073f13bc461,.senate.gov Theres 3 groups out there Republic...,0,0,0,0,0
3,eng_e48b7e7542faafa544ac57b64bc80daf,"""ABC MD, David Anderson, said the additional f...",0,0,0,0,0
4,eng_7c581fb77bce8033aeba3d6dbd6273eb,"""bad people"" I have some conservative values s...",0,0,0,0,0


In [16]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length # Store max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, truncation=True, padding=False, max_length=self.max_length, return_tensors='pt')

        # Ensure consistent tensor conversion for all items
        item = {key: encoding[key].squeeze() for key in encoding.keys()}
        # CHANGE THIS LINE: Use torch.float instead of torch.long for multi-label classification
        item['labels'] = torch.tensor(label, dtype=torch.float)
        return item


In [17]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Create train and Test dataset for multilabel
train_dataset = PolarizationDataset(train['text'].tolist(), train[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)
dev_dataset = PolarizationDataset(val['text'].tolist(), val[['gender/sexual','political','religious','racial/ethnic','other']].values.tolist(), tokenizer)


In [18]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5, problem_type="multi_label_classification") # 5 labels

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# Define metrics function for multi-label classification
def compute_metrics_multilabel(p):
    # Sigmoid the predictions to get probabilities
    probs = torch.sigmoid(torch.from_numpy(p.predictions))
    # Convert probabilities to predicted labels (0 or 1)
    preds = (probs > 0.5).int().numpy()
    # Compute macro F1 score
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

# Define training arguments
training_args = TrainingArguments(
    output_dir=f"./",
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=100,
    disable_tqdm=False
)

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics_multilabel,  # Use the new metrics function
    data_collator=DataCollatorWithPadding(tokenizer)
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set for Subtask 2: {eval_results['eval_f1_macro']}")

# Subtask 3: Manifestation Identification
Multi-label classification to classify how polarization is expressed, with multiple possible labels including Vilification, Extreme Language, Stereotype, Invalidation, Lack of Empathy, and Dehumanization.



In [None]:
train = pd.read_csv('subtask3/train/eng.csv')
val = pd.read_csv('subtask3/train/eng.csv')

train.head()

In [None]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length # Store max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, truncation=True, padding=False, max_length=self.max_length, return_tensors='pt')

        # Ensure consistent tensor conversion for all items
        item = {key: encoding[key].squeeze() for key in encoding.keys()}
        # CHANGE THIS LINE: Use torch.float instead of torch.long for multi-label classification
        item['labels'] = torch.tensor(label, dtype=torch.float)
        return item

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Create train and Test dataset for multilabel
train_dataset = PolarizationDataset(train['text'].tolist(), train[['vilification','extreme_language','stereotype','invalidation','lack_of_empathy','dehumanization']].values.tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val[['vilification','extreme_language','stereotype','invalidation','lack_of_empathy','dehumanization']].values.tolist(), tokenizer)

In [None]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6, problem_type="multi_label_classification") # use 6 labels

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir=f"./",
    num_train_epochs=3,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=100,
    disable_tqdm=False
)

# Define metrics function for multi-label classification
def compute_metrics_multilabel(p):
    # Sigmoid the predictions to get probabilities
    probs = torch.sigmoid(torch.from_numpy(p.predictions))
    # Convert probabilities to predicted labels (0 or 1)
    preds = (probs > 0.5).int().numpy()
    # Compute macro F1 score
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics_multilabel,  # Use the new metrics function
    data_collator=DataCollatorWithPadding(tokenizer)
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set for Subtask 3: {eval_results['eval_f1_macro']}")