<a href="https://colab.research.google.com/github/CDL-RecSys/oeaw-ai-winter-school-2023/blob/main/Populism_Detection_%C3%96AW_AI_Winter_School_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Populism detection example

This is a quick guide on how to use a pretrained BERT-based transformer model, to fine tune it on a specific task. In this case, we use a dataset that was pre-labelled by a populism dictionary by [Roodujin and Pauwels](https://doi.org/10.1080/01402382.2011.616665) (2011). The [OLID](https://scholar.harvard.edu/malmasi/olid) dataset used here is usually a benchmark set for the detection of offensive language, but was adapted for our use case here. We will use a pretrained model from [huggingface.co](https://huggingface.co) and add a linear layer for the task of classification and fine tune it on our loaded data set for populism detection.
After training, you will be able to input your own populist sentences to test the model and observe the output.
For this your group should quickly study the definition of populist key messages in the linked [Google Doc](https://docs.google.com/document/d/1BzGD3E_VVBpWkgqlG5G81NgKsb1oTkLjUeFq2rz_3J4/edit?usp=sharing) and produce a set of populist texts according to it.

The training part of this exercise is an adaption from this helpful [tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/) on fine tuning BERT models from Chris McCormick.

In [None]:
!pip install mathjax

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mathjax
  Downloading mathjax-0.1.2.tar.gz (1.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: mathjax
  Building wheel for mathjax (setup.py) ... [?25l[?25hdone
  Created wheel for mathjax: filename=mathjax-0.1.2-py3-none-any.whl size=1938 sha256=45544f04b70637fd2b20cf2eb504c0f093870f7c1148dce578e3108b4f27c7a7
  Stored in directory: /root/.cache/pip/wheels/0d/e2/7e/898637e59aa1a25108f8827990690cd63cf4fa3ae0d27ba8db
Successfully built mathjax
Installing collected packages: mathjax
Successfully installed mathjax-0.1.2


## Setup

First we set up the required packages and set up GPU as our device to use the computational resources of Google Colab.
This example uses pytorch as the framework for model training.

In [None]:
#first we have to install the transformers and the pytorch python library
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.26.0


In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Choose a pretrained model

Here we will use the [BERT base model](https://huggingface.co/bert-base-uncased), which is a large pretrained model on a corpus of English texts from various sources (to get more information on how it was pre-trained check the model page).

In [None]:
from transformers import BertForSequenceClassification
#load the bert model from huggingface
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

#freeze the parameters from the pretrained model, to use them for our task
for param in model.bert.parameters():
  param.requires_grad = False

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Data Preparation

Before fine tuning our model we need to prepare our data to fit our model. For that we use the models tokenizer to split our text into single tokens and convert them to IDs, so that every text gets represented as a sequence of IDs corresponding to the models vocabulary.
Then the input data is converted into tensors and split into a training and validation set.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer
from torch.utils.data import TensorDataset, random_split

#import the dataset
url = 'https://owncloud.tuwien.ac.at/index.php/s/p0pa3c1Iu6EFilC/download'
train = pd.read_csv(url)
train = train[:6000]

#encode our binary populism labels generated by the dictionary
encoder = LabelEncoder()
train['Label'] = encoder.fit_transform(train['Label'])

#load the tokenizer that comes with the model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def transform_data(df):
    input_ids = []
    attention_masks = []
    labels = []
    
    for _, row in df.iterrows():
        #use the tokenizer to split each text into single tokens
        tokenized_text = tokenizer.tokenize(row["Text"])
        
        #convert tokens into indices corresponding to the vocabulary
        indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
        
        #create the attention masks
        attention_mask = [1] * len(indexed_tokens)
        
        #pad sequences with zeros to the maxium
        max_length = 512
        padding = [0] * (max_length - len(indexed_tokens))
        indexed_tokens += padding
        attention_mask += padding
        
        input_ids.append(indexed_tokens)
        attention_masks.append(attention_mask)
        labels.append(row["Label"])
    
    return input_ids, attention_masks, labels

train_ids, train_attention_masks, train_labels = transform_data(train)

#convert the data to pytorch tensors and send them to the GPU
train_input = torch.tensor(train_ids, dtype = torch.long).to(device)
train_masks = torch.tensor(train_attention_masks, dtype = torch.long).to(device)
train_labels = torch.tensor(train_labels, dtype = torch.long).to(device)

#split the data in a 90% training and a 10% validation split for optimization
tensor_set = TensorDataset(train_input, train_masks, train_labels)
train_size = int(0.9 * len(tensor_set))
val_size = len(tensor_set) - train_size
train_set, val_set =  random_split(tensor_set, [train_size, val_size])

print(f'The training set contains {train_size} samples')
print(f'The validation set contains {val_size} samples')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

The training set contains 5400 samples
The validation set contains 600 samples


## Fine Tuning

Now we initialize a new linear layer for our model that is specific to our task of binary classification. Then we initialize the model parameters according to recommendations from the original [BERT-paper](https://arxiv-org/pdf/1810.04805.pdf).

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup
#define the new layer
model.classifier = torch.nn.Linear(768, 2)

#move the model to the device (GPU in our case)
model = model.to(device)

#set batch size
batch_size = 32

#prepare pytorch data loaders with randomly sampled training data
train_loader = DataLoader(train_set, sampler = RandomSampler(train_set), batch_size = batch_size)
val_loader = DataLoader(val_set, sampler = SequentialSampler(val_set), batch_size = batch_size)

#set model parameters
#AdamW optimizer is used for optimization
optimizer = AdamW(model.parameters(), lr = 2e-5, eps =  1e-8)

#number of epochs used for training, which determines the training steps
epochs = 3
total_steps = len(train_loader) * epochs

#create the scheduler for the learning rate
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)




Now the model is trained on our loaded data and optimized using the validation set. The performance criterion here is accuracy:

$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
$$

In [None]:
import numpy as np
import time
import datetime

#helper function for the calculation of accuracy
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

#helper function for the tracking of training time
def format_time(elapsed):
    #round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    #format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
import random

#set a seed to reproduce the example
seed_val = 1337

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

#Training example from the mentioned tutorial

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_loader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_loader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # It returns different numbers of parameters depending on what arguments
        # arge given and what flags are set. For our useage here, it returns
        # the loss (because we provided labels) and the "logits"--the model
        # outputs prior to activation.
        loss, logits = model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels, return_dict = False)

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_loader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in val_loader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            # Get the "logits" output by the model. The "logits" are the output
            # values prior to applying an activation function like the softmax.
            (loss, logits) = model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels, return_dict = False)


        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(val_loader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(val_loader)
    
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...
  Batch    40  of    169.    Elapsed: 0:00:42.
  Batch    80  of    169.    Elapsed: 0:01:23.
  Batch   120  of    169.    Elapsed: 0:02:04.
  Batch   160  of    169.    Elapsed: 0:02:47.

  Average training loss: 0.35
  Training epcoh took: 0:02:57

Running Validation...
  Accuracy: 0.97
  Validation Loss: 0.22
  Validation took: 0:00:19

Training...
  Batch    40  of    169.    Elapsed: 0:00:43.
  Batch    80  of    169.    Elapsed: 0:01:26.
  Batch   120  of    169.    Elapsed: 0:02:08.


KeyboardInterrupt: ignored

## Test your newly trained model

Now it is your turn to create some test data to try out the model.
Work together as a team to create 5 unique texts that you would consider populist according to the given definition. In this case it suffices, if one of the populist motives (Anti-Elitism, People-Centrism, People-Sovereignty) can be found in your text. The texts should have a length of between 10 and 30 words each.
If you came up with your sentences, just enter them below one after another and they will be added to your custom test set. This will be used to evaluate the model. Note that this is just to showcase the functionality, in reality a test set size of 5 will not give you insightful results.

(If you wish to input more than five sentences, just increase the range of the for-loop)

In [None]:
test_texts = []
input_labels = []

for i in range(5):
  user_input = input('Please enter your populist text: ')
  test_texts.append(user_input)
  input_labels.append(1)

print(test_texts)

In [None]:
#create a data frame of your test data and prepare it for the model evaluation
test = pd.DataFrame(list(zip(test_texts, input_labels)), columns = ['Text', 'Label'])
test['Label'] = encoder.fit_transform(test['Label'])
test_ids, test_attention_masks, test_labels = transform_data(test)

#convert the data to pytorch tensors and send them to the GPU
test_input = torch.tensor(test_ids, dtype = torch.long).to(device)
test_masks = torch.tensor(test_attention_masks, dtype = torch.long).to(device)
test_labels = torch.tensor(test_labels, dtype = torch.long).to(device)

#create the data loader for the test data
test_set = TensorDataset(test_input, test_masks, test_labels)
test_loader = DataLoader(test_set, sampler = SequentialSampler(test_set), batch_size = batch_size)

Now we evaluate our model, by predicting a class label for every test sample and calculating our prediction accuracy.

In [None]:
#prediction on the test set
#set the model to evaluation mode
model.eval()

#store predictions and ground truth to calculate evalutation metrics
predictions , true_labels = [], []

#loop for the prediction on our test set
for batch in test_loader:
  #send batch to the device and unpack its content
  batch = tuple(t.to(device) for t in batch)
  b_input_ids, b_input_mask, b_labels = batch
  
  #calculate our prediction logits
  with torch.no_grad():
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)

  logits = outputs[0]

  #send predictions back to the CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  preds = np.argmax(logits, axis=1).flatten()
  #add predictions to our lists
  for i in preds:
    predictions.append(i)
  for i in label_ids:
    true_labels.append(i)

Now lets take a look at our predictions and see, which texts have been classified as populist.

In [None]:
pred_text = []
#change labels to text
for i in range(len(predictions)):
  if predictions[i] == 0:
    pred_text.append('non populist')
  else:
    pred_text.append('populist')

for i in range(len(test)):
  print(f"The model has predicted the text '{test_texts[i]}' as {pred_text[i]}")

Finally we calculate our test accuracy.

In [None]:
from sklearn.metrics import accuracy_score
print(f'The test accuracy of our classifier is: {accuracy_score(predictions, test_labels)}')