# Day 3: Fine-Tuning a Model

## Classification Models

When looking to train a classification models using a Transformers-based architecture, there are a number of steps to follow (and steps within steps). The first step is to train/test/validate and fine-tune the model using your labeled data; once you are happy with the performance, we are ready to train a final model; and, finally, we use that model to classify as much data as we want (sort of). Let's start...


### 1. Fine-Tuning a Model

We first want to train/test/validate and fine-tune a model using our labeled data. We will divide our labeled data into a training sample (used to train our model) and a testing sample (used to test the performance of the model). Additionally, I will show you how to use cross-validation, to increase confidence in your performance metrics. In this step we can also adjust different hyper-parameters that can ultimately help with performance.

In [None]:
####################################################################
### NO NEED TO CHANGE ANYTHING HERE UNTIL YOU GET THE HANG OF IT ###
####################################################################

# We will need to import a whole bunch of libraries:

# First, our Roberta Tokenizer and our Roberta Classifier:
# from transformers import RobertaTokenizer, RobertaForSequenceClassification
# If you wanted to use another model, such as XLM-RoBERTa, the code would look something like this:
# from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
# More generally, however, you can use the AutoTokenizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Some other libraries to adjust hyper-parameters:
from transformers import get_linear_schedule_with_warmup

# Torch: A Tensor library like NumPy, with strong GPU support
import torch
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import RandomSampler, SequentialSampler

# For performance statistics I like to use sklearn:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, classification_report, precision_score, recall_score

# And the rest to help:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import datetime
import random

def good_update_interval(total_iters, num_desired_updates):
    exact_interval = total_iters / num_desired_updates
    order_of_mag = len(str(total_iters)) - 1
    round_mag = order_of_mag - 1
    update_interval = int(round(exact_interval, -round_mag))
    if update_interval == 0:
        update_interval = 1
    return update_interval

def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))


Our labeled data is of news articles coded into five categories of documents. The five categories are politics, sport, tech, entertainment and business. For the purposes of this example I have recoded all political articles as 1 and taken a random sample of the rest, an recoded them as 0.

In [None]:
# Load your labeled data
data = pd.read_csv("https://raw.githubusercontent.com/svallejovera/iesp-uerj/main/politics_sample.csv")

# Check your data
print(data.head(5))

# Set the main variable to be used for classification:
var = 'Label_Politics' ## <- CHANGE THIS BY LOOKING AT YOUR LABELED DATASET!!!
data[var].value_counts()

                                                Text  Label Label_Cat  \
0  Budget to set scene for election\n \n Gordon B...      0  Politics   
1  Army chiefs in regiments decision\n \n Militar...      0  Politics   
2  Howard denies split over ID cards\n \n Michael...      0  Politics   
3  Observers to monitor UK election\n \n Minister...      0  Politics   
4  Kilroy names election seat target\n \n Ex-chat...      0  Politics   

   Label_Politics  
0               1  
1               1  
2               1  
3               1  
4               1  


Unnamed: 0_level_0,count
Label_Politics,Unnamed: 1_level_1
0,420
1,417


In [None]:
####################################################################
### NO NEED TO CHANGE ANYTHING HERE UNTIL YOU GET THE HANG OF IT ###
####################################################################

# Re index and change type/name:
data = data.sample(frac=1).reset_index(drop=True)
data["label"] = data[var].astype('category')
data["label"] = data["label"].cat.codes

# Check that it worked
data["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,420
1,417


In [None]:
###########################################################################
### YOU NEED TO CHANGE THIS TO THE NAME OF YOUR COLUMN WITH THE TEXT!!! ###
###########################################################################

###########################################################################
### If using a different model, here is where you would change it...... ###
###########################################################################

# Tokenize all of the sentences and map the tokens to their word IDs.
# Record the length of each sequence (in terms of BERT tokens).

# Choose tokenizer: "roberta-base" (could be 'roberta-large')
tokenizer = AutoTokenizer.from_pretrained("roberta-base", do_lower_case=True)

# If you were using some other model, like XLM-RoBERTa, then the code would look
# something like this (but then everything else is pretty much the same):
# tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll02-spanish", do_lower_case=True)

input_ids = []
lengths = []
for x, row in data.iterrows():
    encoded_sent = tokenizer.encode(
                        row['Text'], ## <- CHANGE THIS TO NAME OF YOUR TEXT COLUMN
                        add_special_tokens = True,
                   )
    input_ids.append(encoded_sent)
    lengths.append(len(encoded_sent))

print('{:>10,} comments'.format(len(input_ids)))
print('   Min length: {:,} tokens'.format(min(lengths)))
print('   Max length: {:,} tokens'.format(max(lengths)))
print('Median length: {:,} tokens'.format(np.median(lengths)))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (711 > 512). Running this sequence through the model will result in indexing errors


       837 comments
   Min length: 112 tokens
   Max length: 5,390 tokens
Median length: 487.0 tokens


In [None]:
#################################################################################
### You can change the max length of the input to satisfy you computing power ###
#################################################################################

###########################################################################
### YOU NEED TO CHANGE THIS TO THE NAME OF YOUR COLUMN WITH THE TEXT!!! ###
###########################################################################


# We will trunctate the text input since BERT can only handle 512 tokens at a time
# Also, the more tokens you have, the more memory your computer requires

max_len = 120 # Only for the sake of this example

num_truncated = np.sum(np.greater(lengths, max_len))
num_sentences = len(lengths)
prcnt = float(num_truncated) / float(num_sentences)
print('{:,} of {:,} sentences ({:.1%}) in the training set are longer than {:} tokens.'.format(num_truncated, num_sentences, prcnt, max_len))

836 of 837 sentences (99.9%) in the training set are longer than 120 tokens.


In [None]:
###########################################################################
### YOU NEED TO CHANGE THIS TO THE NAME OF YOUR COLUMN WITH THE TEXT!!! ###
###########################################################################

# Decide on your max length
max_len = 120 ## <- CHANGE THIS (if you want)!!!

# Create tokenized data
labels = []
input_ids = []
attn_masks = []

for x, row in data.iterrows():
    encoded_dict = tokenizer.encode_plus(row['Text'], ## <- CHANGE THIS TO NAME OF YOUR TEXT COLUMN
                                              max_length=max_len,
                                              padding='max_length',
                                              truncation=True,
                                              return_tensors='pt')
    input_ids.append(encoded_dict['input_ids'])
    attn_masks.append(encoded_dict['attention_mask'])
    labels.append(row['label'])

# Convert into tensor matrix.
input_ids = torch.cat(input_ids, dim=0)
attn_masks = torch.cat(attn_masks, dim=0)

# Labels list to tensor.
labels = torch.tensor(labels)

# Create TensorDataset.
dataset = TensorDataset(input_ids, attn_masks, labels)

In [None]:
#################################################################################
### You can play with the parameters if you want...                           ###
#################################################################################

#########
# Specify key model parameters here:
model_name = "roberta-base" # <- The model you will choose. It has to match the tokenizer
lr = 2e-5 # <- Learning rate... usually between 2e-5 and 2e-6
epochs = 2 # <- No more than 5 or you will start overfitting
batch_size = 8 # <- Best if multiple of 2^x... The more the better but also the more GPU
#########

In [None]:
####################################################################
### NO NEED TO CHANGE ANYTHING HERE UNTIL YOU GET THE HANG OF IT ###
####################################################################

seed_val = 6
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
torch.cuda.empty_cache() #Clear GPU cache if necessary

training_stats = [] # Store training and validation loss, validation accuracy, and timings.
fold_stats = []

total_t0 = time.time() # Measure the total training time

To determine which hyperparameters to use and then to evaluate the performance of all Transformer models, we use cross-validation (CV). First, the data is divided into $k$ fixed subsets or chunks. In each fold (iteration of CV), one subset is held out from the training process and used to validate the model, while the rest of the chunks, the $k-1$ subsets, are then used to train the model. This process is repeated $k$ times using a different set of test data in each of the folds, and the performance metrics are averaged over all trials to get the approximate true out-of-sample accuracy of the model. If no `best' model is selected from the $k$ runs, there is no need to hold out a further chunk from the data for a true out-of-sample evaluation of performance. The intuition behind CV is to be sure that performance metrics are not affected by 'easy' or 'hard' breaks of the data (e.g., an easy test set might overestimate the accuracy of the model, while a hard test set might underestimate it).

For our example, we will run two folds because of time...

In [None]:
###########################################################################
### YOU NEED TO CHANGE THE NUMBER OF K runs and of N LABELS!!!!!!!!!!!! ###
###########################################################################

# ======================================== #
#              CV Training                 #
# ======================================== #

# repeat 2 times

k_folds = 2 # <- CHANGE THE NUMBER OF FOLDS!!!
kfold = KFold(n_splits=k_folds, shuffle=True)

timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H%M')

for fold, (train_ids, test_ids) in enumerate(kfold.split(dataset)):

    # Print
    print(f'FOLD {fold+1}')
    print('--------------------------------')

    # Sample elements randomly from a given list of ids, no replacement.
    train_subsampler = torch.utils.data.SubsetRandomSampler(train_ids)
    test_subsampler = torch.utils.data.SubsetRandomSampler(test_ids)

    # Define data loaders for training and testing data in this fold
    train_dataloader = torch.utils.data.DataLoader(
                      dataset,
                      batch_size=batch_size, sampler=train_subsampler)
    test_dataloader = torch.utils.data.DataLoader(
                      dataset,
                      batch_size=batch_size, sampler=test_subsampler)

    # Initiate model parameters for each fold
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # <- CHANGE THE NUMBER OF LABELS!!!

####################################################################
### NO NEED TO CHANGE ANYTHING ELSE UNTIL YOU GET THE HANG OF IT ###
####################################################################

    device = torch.device('cuda:0')
    desc = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr = lr, eps = 1e-6)
    total_steps = (int(len(dataset)/batch_size)+1) * epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 10, num_training_steps = total_steps)

    # Run the training loop for defined number of epochs
    for epoch_i in range(0, epochs):
        print("")
        print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
        print('Training...')
        t0 = time.time()
        total_train_loss = 0 # Reset the total loss for this epoch.
        model.train() # Put the model into training mode.
        update_interval = good_update_interval( # Pick an interval on which to print progress updates.
                    total_iters = len(train_dataloader),
                    num_desired_updates = 10
                )

        predictions_t, true_labels_t = [], []
        for step, batch in enumerate(train_dataloader):
            if (step % update_interval) == 0 and not step == 0:
                elapsed = format_time(time.time() - t0)
                print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed), end='\r')
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].to(device)
            # Always clear any previously calculated gradients before performing a backward pass.
            model.zero_grad()
            # Perform a forward pass --returns the loss and the "logits"
            loss = model(b_input_ids,
                               attention_mask=b_input_mask,
                               labels=b_labels)[0]
            logits = model(b_input_ids,
                                attention_mask=b_input_mask,
                                labels=b_labels)[1]

            # Accumulate the training loss over all of the batches so that we can calculate the average loss at the end.
            total_train_loss += loss.item()
            # Perform a backward pass to calculate the gradients.
            loss.backward()
            # Clip the norm of the gradients to 1.0. This is to help prevent the "exploding gradients" problem.
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            # Update parameters and take a step using the computed gradient.
            optimizer.step()
            # Update the learning rate.
            scheduler.step()

            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()
            # Store predictions and true labels
            predictions_t.append(logits)
            true_labels_t.append(label_ids)

        # Combine the results across all batches.
        flat_predictions_t = np.concatenate(predictions_t, axis=0)
        flat_true_labels_t = np.concatenate(true_labels_t, axis=0)
        # For each sample, pick the label (0, 1) with the highest score.
        predicted_labels_t = np.argmax(flat_predictions_t, axis=1).flatten()
        acc_t = accuracy_score(predicted_labels_t, flat_true_labels_t)

        # Calculate the average loss over all of the batches.
        avg_train_loss = total_train_loss / len(train_dataloader)

        # Measure how long this epoch took.
        training_time = format_time(time.time() - t0)

        print("")
        print("  Average training loss: {0:.3f}".format(avg_train_loss))
        print("  Training epoch took: {:}".format(training_time))
        print("  Training accuracy: {:.3f}".format(acc_t))

        if acc_t > 0.9 and epoch_i >= 3:
            break

        # TEST
        # After the completion of each training epoch, measure our performance on our test set.

        print("")
        print("Running test...")
        t0 = time.time()
        model.eval() # Put the model in evaluation mode--the dropout layers behave differently during evaluation.
        total_eval_loss = 0
        predictions, true_labels = [], []
        # Evaluate data for one epoch
        for batch in test_dataloader:
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].to(device)
            with torch.no_grad():
                # Forward pass, calculate logit predictions.
                loss = model(b_input_ids,
                                attention_mask=b_input_mask,
                                labels=b_labels)[0]
                logits = model(b_input_ids,
                                    attention_mask=b_input_mask,
                                    labels=b_labels)[1]
            # Accumulate the test loss.
            total_eval_loss += loss.item()
            # Move logits and labels to CPU
            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()
            # Store predictions and true labels
            predictions.append(logits)
            true_labels.append(label_ids)

        # Combine the results across all batches.
        flat_predictions = np.concatenate(predictions, axis=0)
        flat_true_labels = np.concatenate(true_labels, axis=0)
        # For each sample, pick the label (0, 1) with the highest score.
        predicted_labels = np.argmax(flat_predictions, axis=1).flatten()
        # Calculate the test accuracy.
        val_accuracy = (predicted_labels == flat_true_labels).mean()

        # Calculate the average loss over all of the batches.
        avg_val_loss = total_eval_loss / len(test_dataloader)

        ov_acc = [accuracy_score(predicted_labels, flat_true_labels), recall_score(predicted_labels, flat_true_labels, average="macro"), precision_score(predicted_labels, flat_true_labels, average="macro"),f1_score(predicted_labels, flat_true_labels, average="macro")]
        f1 = list(f1_score(flat_true_labels,predicted_labels,average=None))
        f1_temp = f1_score(flat_true_labels,predicted_labels,average="weighted")
        matrix = confusion_matrix(flat_true_labels,predicted_labels)
        acc = list(matrix.diagonal()/matrix.sum(axis=1))
        cr = pd.DataFrame(classification_report(pd.Series(flat_true_labels),pd.Series(predicted_labels), output_dict=True)).transpose().iloc[0:3, 0:2]
        prec =list(cr.iloc[:,0])
        rec = list(cr.iloc[:,1])

        # Report the final accuracy for this test run.
        print('RoBERTa Prediction accuracy: {:.3f}'.format(val_accuracy))
        print('RoBERTa F1 accuracy: {:.3f}'.format(f1_temp))

        # It is good practice to look at your confusion matrix
        print(matrix)

        # Measure how long the test run took.
        test_time = format_time(time.time() - t0)
        print("  Test Loss: {0:.3f}".format(avg_val_loss))
        print("  Test took: {:}".format(test_time))

    fold_stats.append(
        {
            'fold': fold+1,
            'Training Loss': avg_train_loss,
            'Test Loss': avg_val_loss,
            'Test Accur.': ov_acc[0],
            'f1': [f1, ov_acc[3]],
            'prec': [prec, ov_acc[2]],
            'rec': [rec, ov_acc[1]]
        }
    )


FOLD 1
--------------------------------


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...

  Average training loss: 0.476
  Training epoch took: 0:00:14
  Training accuracy: 0.794

Running test...
RoBERTa Prediction accuracy: 0.971
RoBERTa F1 accuracy: 0.971
[[188  11]
 [  1 219]]
  Test Loss: 0.077
  Test took: 0:00:05

Training...

  Average training loss: 0.055
  Training epoch took: 0:00:13
  Training accuracy: 0.988

Running test...


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RoBERTa Prediction accuracy: 0.988
RoBERTa F1 accuracy: 0.988
[[195   4]
 [  1 219]]
  Test Loss: 0.060
  Test took: 0:00:06
FOLD 2
--------------------------------

Training...

  Average training loss: 0.500
  Training epoch took: 0:00:13
  Training accuracy: 0.749

Running test...
RoBERTa Prediction accuracy: 0.990
RoBERTa F1 accuracy: 0.990
[[219   2]
 [  2 195]]
  Test Loss: 0.034
  Test took: 0:00:06

Training...

  Average training loss: 0.046
  Training epoch took: 0:00:13
  Training accuracy: 0.986

Running test...
RoBERTa Prediction accuracy: 0.988
RoBERTa F1 accuracy: 0.988
[[220   1]
 [  4 193]]
  Test Loss: 0.051
  Test took: 0:00:06




* **Validation/Test Loss ≫ Training Loss → Overfitting**
  The model has memorized the training data but fails to generalize.

* **Validation/Test Loss > Training Loss → Some Overfitting**
  This is expected; usually indicates the model is learning well.

* **Validation/Test Loss < Training Loss → Some Underfitting**
  The model isn’t fitting the training data enough, often a sign that it’s too simple.

* **Validation/Test Loss ≪ Training Loss → Underfitting**
  The model is too weak to capture the patterns in the data.

**Goal**: Minimize **validation/test loss**.
A *little* overfitting is almost always desirable — it means the model is fitting training data well while still generalizing.


We can now get some performance stat to see how we did. You can check the meaning of each stat, how to interpret them, and when to use them, in scholarly journal [Wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall), or check out [this paper](https://d1wqtxts1xzle7.cloudfront.net/37219940/5215ijdkp01-libre.pdf?1428316763=&response-content-disposition=inline%3B+filename%3DA_REVIEW_ON_EVALUATION_METRICS_FOR_DATA.pdf&Expires=1709137264&Signature=f3EFHnlTZXa38ug6~VBumSZrfe9ECAyMUh04CNTzYnXEsVaJS3T12eNPbu7iNP~z3DSTTJ2NAV845v50XBe8Sjm7AylacfjGxcQ8YqaDsMulhkCV8c-JtTrWaLILlSUzbQp9M5Md3ubChx5Y9xkBp~s~XlecEEu9B5QEOjyr2aiZRA6gz98crSv0VKKV2ow986UxoSaWZgaYPmTsTrWU2EN3-0S1~OyO9tf2eFqbb3jUwOl15vX1rzzoG9lcpqbURB0eGMqPlXoWPHYBAlGmvUJOGxfkz15VpCxYtg-RoL5IYJONHlkV8GDWXntOm4WdY-ZIcgF3f3c7XhpDzgzvGw__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA).

In [None]:
fold_stats

[{'fold': 1,
  'Training Loss': 0.0548539227061413,
  'Test Loss': 0.05998427368028371,
  'Test Accur.': 0.9880668257756563,
  'f1': [[np.float64(0.9873417721518988), np.float64(0.9887133182844243)],
   0.9880275452181615],
  'prec': [[0.9948979591836735, 0.9820627802690582, 0.9880668257756563],
   0.9876770214709913],
  'rec': [[0.9798994974874372, 0.9954545454545455, 0.9880668257756563],
   0.9884803697263659]},
 {'fold': 2,
  'Training Loss': 0.045914599116241454,
  'Test Loss': 0.05121823352814283,
  'Test Accur.': 0.9880382775119617,
  'f1': [[np.float64(0.9887640449438202), np.float64(0.9872122762148338)],
   0.987988160579327],
  'prec': [[0.9821428571428571, 0.9948453608247423, 0.9880382775119617],
   0.9875852722971266],
  'rec': [[0.995475113122172, 0.9796954314720813, 0.9880382775119617],
   0.9884941089837997]}]

In [None]:
politics_stats = []
politics_stats.append(
    {
        'Model': model_name,
        'lr': lr,
        'epochs': epochs,
        'batch_size': batch_size,

        '0_f1': np.mean([x['f1'][0][0] for x in fold_stats ]),
        '1_f1': np.mean([x['f1'][0][1] for x in fold_stats ]),

        'overall_mean_f1': np.mean([x['f1'][1] for x in fold_stats ]),
        'overall_mean_f1_sd': np.std([x['f1'][1] for x in fold_stats ]),
    }
)

In [None]:
print(fold_stats)

[{'fold': 1, 'Training Loss': 0.0548539227061413, 'Test Loss': 0.05998427368028371, 'Test Accur.': 0.9880668257756563, 'f1': [[np.float64(0.9873417721518988), np.float64(0.9887133182844243)], 0.9880275452181615], 'prec': [[0.9948979591836735, 0.9820627802690582, 0.9880668257756563], 0.9876770214709913], 'rec': [[0.9798994974874372, 0.9954545454545455, 0.9880668257756563], 0.9884803697263659]}, {'fold': 2, 'Training Loss': 0.045914599116241454, 'Test Loss': 0.05121823352814283, 'Test Accur.': 0.9880382775119617, 'f1': [[np.float64(0.9887640449438202), np.float64(0.9872122762148338)], 0.987988160579327], 'prec': [[0.9821428571428571, 0.9948453608247423, 0.9880382775119617], 0.9875852722971266], 'rec': [[0.995475113122172, 0.9796954314720813, 0.9880382775119617], 0.9884941089837997]}]


In [None]:
print(politics_stats)

[{'Model': 'roberta-base', 'lr': 2e-05, 'epochs': 2, 'batch_size': 8, '0_f1': np.float64(0.9880529085478595), '1_f1': np.float64(0.987962797249629), 'overall_mean_f1': np.float64(0.9880078528987443), 'overall_mean_f1_sd': np.float64(1.9692319417230486e-05)}]


### 2. Final Model

AMAZEBALLZ!! Ok, now that we know that our model is the shit, we can use ALL the data to train a final model that we can then use to classify whatever (english news) corpus we want.

In [None]:
###########################################################################
### YOU NEED TO CHANGE THE NUMBER OF N LABELS!!!!!!!!!!!!               ###
###########################################################################


# ======================================== #
#               Training                   #
# ======================================== #

timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H%M')

# Define data loaders for training and testing data in this fold
train_dataloader = torch.utils.data.DataLoader(
                       dataset,
                       batch_size=batch_size)

# Initiate model parameters for each fold
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)  # <- CHANGE THE NUMBER OF LABELS!!!
device = torch.device('cuda:0')
desc = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr = lr, eps = 1e-8)
total_steps = (int(len(dataset)/batch_size)+1) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 3, num_training_steps = total_steps)

    # Run the training loop for defined number of epochs
for epoch_i in range(0, epochs):
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')
    t0 = time.time()
    total_train_loss = 0 # Reset the total loss for this epoch.
    model.train() # Put the model into training mode.
    update_interval = good_update_interval( # Pick an interval on which to print progress updates.
                    total_iters = len(train_dataloader),
                    num_desired_updates = 10)

    predictions_t, true_labels_t = [], []
    for step, batch in enumerate(train_dataloader):
        if (step % update_interval) == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed), end='\r')
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        # Always clear any previously calculated gradients before performing a backward pass.
        model.zero_grad()
        # Perform a forward pass --returns the loss and the "logits"
        loss = model(b_input_ids,attention_mask=b_input_mask,labels=b_labels)[0]
        logits = model(b_input_ids,attention_mask=b_input_mask,labels=b_labels)[1]

        # Accumulate the training loss over all of the batches so that we can calculate the average loss at the end.
        total_train_loss += loss.item()
        # Perform a backward pass to calculate the gradients.
        loss.backward()
        # Clip the norm of the gradients to 1.0. This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # Update parameters and take a step using the computed gradient.
        optimizer.step()
        # Update the learning rate.
        scheduler.step()

        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # Store predictions and true labels
        predictions_t.append(logits)
        true_labels_t.append(label_ids)

    # Combine the results across all batches.
    flat_predictions_t = np.concatenate(predictions_t, axis=0)
    flat_true_labels_t = np.concatenate(true_labels_t, axis=0)
    # For each sample, pick the label (0, 1) with the highest score.
    predicted_labels_t = np.argmax(flat_predictions_t, axis=1).flatten()
    acc_t = accuracy_score(predicted_labels_t, flat_true_labels_t)

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.3f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
    print("  Training accuracy: {:.3f}".format(acc_t))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...

  Average training loss: 0.220
  Training epoch took: 0:00:28
  Training accuracy: 0.915

Training...

  Average training loss: 0.031
  Training epoch took: 0:00:28
  Training accuracy: 0.993


In [None]:
# Save it in wherever FOLDER you want to save it... (it will be heavy though)
# model.save_pretrained('/my_models/political_news/')

### 3. Using the Model

To use the model, we just need to load it (we already loaded it, but I leave the code for when this is not the case), and CLASSIFY!!

In [None]:
# If this was a new session, then you would need to load the tokenizer again...
# tokenizer = AutoTokenizer.from_pretrained("roberta-base", do_lower_case=True)

# If you saved your model somewhere, this is how you would retrieve it...
# my_model = AutoModelForSequenceClassification.from_pretrained('/my_models/political_news/', num_labels=2) ## <- CHANGE THE NUMBER OF LABELS
device = torch.device("cpu")
my_model = model.to(device)

# Now we put our model in evaluation form:
my_model.eval()

label_dict = {0: "Not Politics", 1: "Politics"}

def predict(sentence):
    sentence = tokenizer.encode_plus(sentence, return_tensors='pt')
    outputs = my_model(sentence["input_ids"], attention_mask=sentence["attention_mask"])
    outputs = outputs[0].detach().numpy()
    predicted_label = np.argmax(outputs)
    label = label_dict[predicted_label]
    return predicted_label, label

predict("Voters 'reject EU by two to one' British voters would reject the European constitution by two to one, according to a poll posing the question the government will put to the country.")


(np.int64(1), 'Politics')

In [None]:
predict("Footy Headlines can leak the Real Madrid 24-25 home kit, as worn by Bellingham (mocked-up picture). It has a no-nonsense design in white and black, plus a subtle Houndstooth pattern.")


(np.int64(0), 'Not Politics')

And here is some code to loop everything neatly so you can pass your target corpus through your fine-tuned model:

In [None]:
from tqdm import tqdm

# Load data
df = pd.read_parquet("hf://datasets/Themira/en_si_news_classification_with_label_name/data/train_en-00000-of-00001.parquet")
df_sample = df.iloc[:500]
print(df_sample.head(10))

                                            sentence          label
0  Russian Skating Star Is 'Lighthearted' At Prac...         Sports
1  Pete Buttigieg Rejects Notion That Black Voter...      Political
2  Jay Z Is Making A Movie And Docuseries Based O...  Entertainment
3  See All The Looks From The 2018 Golden Globes ...  Entertainment
4  Jimmy Fallon Calls Out Mystery Object Coming F...  Entertainment
5  Samantha Bee Gets Candid About Dealing With Tw...  Entertainment
6  Exes Bella Hadid And The Weeknd Spotted Kissin...  Entertainment
7  LeBron James Says Orlando Shooting Puts Import...         Sports
8  House Panel Calls New Postmaster General To Ex...      Political
9  The Humble Honeybee Honeybees are incomparable...        Science


In [None]:
# Run corpus through model:
predicted_sentences = []

for i in tqdm(df_sample['sentence']):
    pred_temp = predict(i)
    predicted_sentences.append(pred_temp)

df_sample['predicted_sentences'] = predicted_sentences
print(df_sample.head(10))
# df_sample.to_csv(r'pred_df_sample.csv')

100%|██████████| 500/500 [01:22<00:00,  6.06it/s]

                                            sentence          label  \
0  Russian Skating Star Is 'Lighthearted' At Prac...         Sports   
1  Pete Buttigieg Rejects Notion That Black Voter...      Political   
2  Jay Z Is Making A Movie And Docuseries Based O...  Entertainment   
3  See All The Looks From The 2018 Golden Globes ...  Entertainment   
4  Jimmy Fallon Calls Out Mystery Object Coming F...  Entertainment   
5  Samantha Bee Gets Candid About Dealing With Tw...  Entertainment   
6  Exes Bella Hadid And The Weeknd Spotted Kissin...  Entertainment   
7  LeBron James Says Orlando Shooting Puts Import...         Sports   
8  House Panel Calls New Postmaster General To Ex...      Political   
9  The Humble Honeybee Honeybees are incomparable...        Science   

  predicted_sentences  
0   (0, Not Politics)  
1   (0, Not Politics)  
2   (0, Not Politics)  
3   (0, Not Politics)  
4   (0, Not Politics)  
5   (0, Not Politics)  
6   (0, Not Politics)  
7   (0, Not Politics)  
8 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sample['predicted_sentences'] = predicted_sentences


Isn't that neat!! Now you can get a larger corpus, run it through the prediction model, and get your results.

Is this the last step? **No**. You still need to do one more validation: out-of-sample validation. You want to make sure that the sample you used as a training set was representative of your corpus, or that your model is able to properly classify text beyond the training set. To do this, you get a random sample from your corpus (*unseen* by the algorithm), label it, classify it (using your model), and check performance. Here, you might also want to look at the confusion matrix to better understand how is the model behaving (e.g., what type of error is more common, what categories it is over/under-predicting, etc.).

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# confusion matrix
y_true_num = np.where(df_sample['label']=="Political", 1, 0)

# prediction
y_pred = df_sample['predicted_sentences'].apply(lambda x: x[0])

# confusion matrix
cm = confusion_matrix(y_true_num, y_pred)
print(cm)

# measures
accuracy = accuracy_score(y_true_num, y_pred)
precision = precision_score(y_true_num, y_pred)
recall = recall_score(y_true_num, y_pred)

print("Accuracy: ", accuracy)
print("Precision:", precision)
print("Recall:   ", recall)

[[393   3]
 [ 99   5]]
Accuracy:  0.796
Precision: 0.625
Recall:    0.04807692307692308
