<a href="https://colab.research.google.com/github/Gracej12/PPOL6801_econnews/blob/main/Classifying_econ_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 9: Transformers

# 2. Classification Models

When looking to train a classification models using a Transformers-based architecture, there are number of steps to follow (and steps within steps). The first step is to train/test/validate and fine-tune the model using your labeled data; once you are happy with the performance, we are ready to train a final model; and, finally, we use that model to classify as much data as we want (sort of). Let's start...


## 2.1. Fine-Tuning a Model

We first want to train/test/validate and fine-tune a model using our labeled data. We will divide our labeled data into a training sample (used to train our model) and a testing sample (used to test the performance of the model). Additionally, I will show you how to use cross-validation, to increase confidence in your performance metrics. In this step we can also adjust different hyper-parameters that can ultimately help with performance.

In [1]:
####################################################################
### NO NEED TO CHANGE ANYTHING HERE UNTIL YOU GET THE HANG OF IT ###
####################################################################

# We will need to import a whole bunch of libraries:

# First, our Roberta Tokenizer and our Roberta Classifier:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
# If you wanted to use another model, such as XLM-RoBERTa, the code would look something like this:
# from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification

# Some other libraries to adjust hyper-parameters:
from transformers import get_linear_schedule_with_warmup, AdamW

# Torch: A Tensor library like NumPy, with strong GPU support
import torch
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import RandomSampler, SequentialSampler

# For performance statistics I like to use sklearn:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, classification_report, precision_score, recall_score

# And the rest to help:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import datetime
import random

def good_update_interval(total_iters, num_desired_updates):
    exact_interval = total_iters / num_desired_updates
    order_of_mag = len(str(total_iters)) - 1
    round_mag = order_of_mag - 1
    update_interval = int(round(exact_interval, -round_mag))
    if update_interval == 0:
        update_interval = 1
    return update_interval

def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))


Our labeled data is of news articles coded into five categories of documents. The five categories are politics, sport, tech, entertainment and business. For the purposes of this example I have recoded all political articles as 1 and taken a random sample of the rest, an recoded them as 0.

In [2]:
####################################################################
### YOU NEED TO IMPORT HERE YOUR LABELED DATA. CHANGE THIS!!!    ###
####################################################################

# Load your labeled data
data = pd.read_csv("https://raw.githubusercontent.com/Gracej12/PPOL6801_econnews/main/data/labeled_articles.csv")

# Check your data
len(data)

# Set the main variable to be used for classification:
var = 'label' ## <- CHANGE THIS BY LOOKING AT YOUR LABELED DATASET!!!
data[var].value_counts()

label
inflation           81
national debt       81
GDP                 78
housing costs       65
stock market        55
employment wages    54
Name: count, dtype: int64

In [3]:
####################################################################
### NO NEED TO CHANGE ANYTHING HERE UNTIL YOU GET THE HANG OF IT ###
####################################################################

# Re index and change type/name:
data = data.sample(frac=1).reset_index(drop=True)
data["label"] = data[var].astype('category')
data["label"] = data["label"].cat.codes

# Check that it worked
data["label"].value_counts()

label
3    81
4    81
0    78
2    65
5    55
1    54
Name: count, dtype: int64

In [4]:
###########################################################################
### YOU NEED TO CHANGE THIS TO THE NAME OF YOUR COLUMN WITH THE TEXT!!! ###
###########################################################################

###########################################################################
### If using a different model, here is where you would change it...... ###
###########################################################################

# Tokenize all of the sentences and map the tokens to their word IDs.
# Record the length of each sequence (in terms of BERT tokens).

# Choose tokenizer: "roberta-base" (could be 'roberta-large')
tokenizer = RobertaTokenizer.from_pretrained("roberta-base", do_lower_case=True)

# If you were using some other model, like XLM-RoBERTa, then the code would look
# something like this (but then everything else is pretty much the same):
# tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll02-spanish", do_lower_case=True)

input_ids = []
lengths = []
for x, row in data.iterrows():
    encoded_sent = tokenizer.encode(
                        row['title_extracted_content'], ## <- CHANGE THIS TO NAME OF YOUR TEXT COLUMN
                        add_special_tokens = True,
                   )
    input_ids.append(encoded_sent)
    lengths.append(len(encoded_sent))

print('{:>10,} comments'.format(len(input_ids)))
print('   Min length: {:,} tokens'.format(min(lengths)))
print('   Max length: {:,} tokens'.format(max(lengths)))
print('Median length: {:,} tokens'.format(np.median(lengths)))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (993 > 512). Running this sequence through the model will result in indexing errors


       414 comments
   Min length: 11 tokens
   Max length: 12,092 tokens
Median length: 608.5 tokens


In [5]:
#################################################################################
### You can change the max length of the input to satisfy you computing power ###
#################################################################################

###########################################################################
### YOU NEED TO CHANGE THIS TO THE NAME OF YOUR COLUMN WITH THE TEXT!!! ###
###########################################################################


# We will trunctate the text input since BERT can only handle 512 tokens at a time
# Also, the more tokens you have, the mode memory your computer requires

max_len = 200 # Only for the sake of this example

num_truncated = np.sum(np.greater(lengths, max_len))
num_sentences = len(lengths)
prcnt = float(num_truncated) / float(num_sentences)
print('{:,} of {:,} sentences ({:.1%}) in the training set are longer than {:} tokens.'.format(num_truncated, num_sentences, prcnt, max_len))

328 of 414 sentences (79.2%) in the training set are longer than 200 tokens.


In [6]:
###########################################################################
### YOU NEED TO CHANGE THIS TO THE NAME OF YOUR COLUMN WITH THE TEXT!!! ###
###########################################################################

# Create tokenized data
labels = []
input_ids = []
attn_masks = []

for x, row in data.iterrows():
    encoded_dict = tokenizer.encode_plus(row['title_extracted_content'], ## <- CHANGE THIS TO NAME OF YOUR TEXT COLUMN
                                              max_length=max_len,
                                              padding='max_length',
                                              truncation=True,
                                              return_tensors='pt')
    input_ids.append(encoded_dict['input_ids'])
    attn_masks.append(encoded_dict['attention_mask'])
    labels.append(row['label'])

# Convert into tensor matrix.
input_ids = torch.cat(input_ids, dim=0)
attn_masks = torch.cat(attn_masks, dim=0)

# Labels list to tensor.
labels = torch.tensor(labels)

# Create TensorDataset.
dataset = TensorDataset(input_ids, attn_masks, labels)

In [7]:
#################################################################################
### You can play with the parameters if you want...                           ###
#################################################################################

#########
# Specify key model parameters here:
model_name = "roberta-base" # <- The model you will choose. It has to match the tokenizer
lr = 2e-5 # <- Learning rate... usually between 2e-5 and 2e-6
epochs = 3 # <- No more than 5 or you will start overfitting
batch_size = 4 # <- Best if multiple of 2^x... The more the better but also the more GPU
#########

In [8]:
####################################################################
### NO NEED TO CHANGE ANYTHING HERE UNTIL YOU GET THE HANG OF IT ###
####################################################################

seed_val = 6
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
torch.cuda.empty_cache() #Clear GPU cache if necessary

training_stats = [] # Store training and validation loss, validation accuracy, and timings.
fold_stats = []

total_t0 = time.time() # Measure the total training time

To determine which hyperparameters to use and then to evaluate the performance of all Transformer models, we use cross-validation (CV). First, the data is divided into $k$ fixed subsets or chunks. In each fold (iteration of CV), one subset is held out from the training process and used to validate the model, while the rest of the chunks, the $k-1$ subsets, are then used to train the model. This process is repeated $k$ times using a different set of test data in each of the folds, and the performance metrics are averaged over all trials to get the approximate true out-of-sample accuracy of the model. If no `best' model is selected from the $k$ runs, there is no need to hold out a further chunk from the data for a true out-of-sample evaluation of performance. The intuition behind CV is to be sure that performance metrics are not affected by 'easy' or 'hard' breaks of the data (e.g., an easy test set might overestimate the accuracy of the model, while a hard test set might underestimate it).

For our example, we will run 2 times each fold because of time...

In [9]:
###########################################################################
### YOU NEED TO CHANGE THE NUMBER OF K runs and of N LABELS!!!!!!!!!!!! ###
###########################################################################

# ======================================== #
#              CV Training                 #
# ======================================== #

# repeat 2 times

k_folds = 3 # <- CHANGE THE NUMBER OF FOLDS!!!
kfold = KFold(n_splits=k_folds, shuffle=True)

timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H%M')

for fold, (train_ids, test_ids) in enumerate(kfold.split(dataset)):

    # Print
    print(f'FOLD {fold+1}')
    print('--------------------------------')

    # Sample elements randomly from a given list of ids, no replacement.
    train_subsampler = torch.utils.data.SubsetRandomSampler(train_ids)
    test_subsampler = torch.utils.data.SubsetRandomSampler(test_ids)

    # Define data loaders for training and testing data in this fold
    train_dataloader = torch.utils.data.DataLoader(
                      dataset,
                      batch_size=batch_size, sampler=train_subsampler)
    test_dataloader = torch.utils.data.DataLoader(
                      dataset,
                      batch_size=batch_size, sampler=test_subsampler)

    # Initiate model parameters for each fold
    model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=6) # <- CHANGE THE NUMBER OF LABELS!!!

####################################################################
### NO NEED TO CHANGE ANYTHING ELSE UNTIL YOU GET THE HANG OF IT ###
####################################################################

    device = torch.device('cuda:0')
    desc = model.to(device)
    optimizer = AdamW(model.parameters(), lr = lr, eps = 1e-6)
    total_steps = (int(len(dataset)/batch_size)+1) * epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 10, num_training_steps = total_steps)

    # Run the training loop for defined number of epochs
    for epoch_i in range(0, epochs):
        print("")
        print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
        print('Training...')
        t0 = time.time()
        total_train_loss = 0 # Reset the total loss for this epoch.
        model.train() # Put the model into training mode.
        update_interval = good_update_interval( # Pick an interval on which to print progress updates.
                    total_iters = len(train_dataloader),
                    num_desired_updates = 10
                )

        predictions_t, true_labels_t = [], []
        for step, batch in enumerate(train_dataloader):
            if (step % update_interval) == 0 and not step == 0:
                elapsed = format_time(time.time() - t0)
                print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed), end='\r')
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].to(device)
            # Always clear any previously calculated gradients before performing a backward pass.
            model.zero_grad()
            # Perform a forward pass --returns the loss and the "logits"
            loss = model(b_input_ids,
                               attention_mask=b_input_mask,
                               labels=b_labels)[0]
            logits = model(b_input_ids,
                                attention_mask=b_input_mask,
                                labels=b_labels)[1]

            # Accumulate the training loss over all of the batches so that we can calculate the average loss at the end.
            total_train_loss += loss.item()
            # Perform a backward pass to calculate the gradients.
            loss.backward()
            # Clip the norm of the gradients to 1.0. This is to help prevent the "exploding gradients" problem.
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            # Update parameters and take a step using the computed gradient.
            optimizer.step()
            # Update the learning rate.
            scheduler.step()

            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()
            # Store predictions and true labels
            predictions_t.append(logits)
            true_labels_t.append(label_ids)

        # Combine the results across all batches.
        flat_predictions_t = np.concatenate(predictions_t, axis=0)
        flat_true_labels_t = np.concatenate(true_labels_t, axis=0)
        # For each sample, pick the label (0, 1) with the highest score.
        predicted_labels_t = np.argmax(flat_predictions_t, axis=1).flatten()
        acc_t = accuracy_score(predicted_labels_t, flat_true_labels_t)

        # Calculate the average loss over all of the batches.
        avg_train_loss = total_train_loss / len(train_dataloader)

        # Measure how long this epoch took.
        training_time = format_time(time.time() - t0)

        print("")
        print("  Average training loss: {0:.3f}".format(avg_train_loss))
        print("  Training epoch took: {:}".format(training_time))
        print("  Training accuracy: {:.3f}".format(acc_t))

        if acc_t > 0.9 and epoch_i >= 3:
            break

    # TEST
    # After the completion of each training epoch, measure our performance on our test set.

    print("")
    print("Running test...")
    t0 = time.time()
    model.eval() # Put the model in evaluation mode--the dropout layers behave differently during evaluation.
    total_eval_loss = 0
    predictions, true_labels = [], []
    # Evaluate data for one epoch
    for batch in test_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        with torch.no_grad():
            # Forward pass, calculate logit predictions.
            loss = model(b_input_ids,
                               attention_mask=b_input_mask,
                               labels=b_labels)[0]
            logits = model(b_input_ids,
                                attention_mask=b_input_mask,
                                labels=b_labels)[1]
        # Accumulate the test loss.
        total_eval_loss += loss.item()
        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # Store predictions and true labels
        predictions.append(logits)
        true_labels.append(label_ids)

    # Combine the results across all batches.
    flat_predictions = np.concatenate(predictions, axis=0)
    flat_true_labels = np.concatenate(true_labels, axis=0)
    # For each sample, pick the label (0, 1) with the highest score.
    predicted_labels = np.argmax(flat_predictions, axis=1).flatten()
    # Calculate the test accuracy.
    val_accuracy = (predicted_labels == flat_true_labels).mean()

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(test_dataloader)

    ov_acc = [accuracy_score(predicted_labels, flat_true_labels), recall_score(predicted_labels, flat_true_labels, average="macro"), precision_score(predicted_labels, flat_true_labels, average="macro"),f1_score(predicted_labels, flat_true_labels, average="macro")]
    f1 = list(f1_score(flat_true_labels,predicted_labels,average=None))
    matrix = confusion_matrix(flat_true_labels,predicted_labels)
    acc = list(matrix.diagonal()/matrix.sum(axis=1))
    cr = pd.DataFrame(classification_report(pd.Series(flat_true_labels),pd.Series(predicted_labels), output_dict=True)).transpose().iloc[0:3, 0:2]
    prec =list(cr.iloc[:,0])
    rec = list(cr.iloc[:,1])

    # Report the final accuracy for this test run.
    print("  0: {0:.3f}".format(acc[0]))
    print("  1: {0:.3f}".format(acc[1]))
    print('RoBERTa Prediction accuracy: {:.3f}'.format(val_accuracy))

    # Measure how long the test run took.
    test_time = format_time(time.time() - t0)
    print("  Test Loss: {0:.3f}".format(avg_val_loss))
    print("  Test took: {:}".format(test_time))

    fold_stats.append(
        {
            'fold': fold+1,
            'Training Loss': avg_train_loss,
            'Test Loss': avg_val_loss,
            'Test Accur.': ov_acc[0],
            '0 Accur.': acc[0],
            '1 Accur.': acc[1],
            'f1': [f1, ov_acc[3]],
            'prec': [prec, ov_acc[2]],
            'rec': [rec, ov_acc[1]]
        }
    )

FOLD 1
--------------------------------


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...

  Average training loss: 1.660
  Training epoch took: 0:00:17
  Training accuracy: 0.341

Training...

  Average training loss: 0.570
  Training epoch took: 0:00:16
  Training accuracy: 0.891

Training...

  Average training loss: 0.178
  Training epoch took: 0:00:17
  Training accuracy: 0.964

Running test...
  0: 0.962
  1: 1.000
RoBERTa Prediction accuracy: 0.935
  Test Loss: 0.233
  Test took: 0:00:04
FOLD 2
--------------------------------


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...

  Average training loss: 1.462
  Training epoch took: 0:00:17
  Training accuracy: 0.446

Training...

  Average training loss: 0.296
  Training epoch took: 0:00:19
  Training accuracy: 0.942

Training...

  Average training loss: 0.087
  Training epoch took: 0:00:18
  Training accuracy: 0.986

Running test...
  0: 1.000
  1: 0.944
RoBERTa Prediction accuracy: 0.957
  Test Loss: 0.208
  Test took: 0:00:04
FOLD 3
--------------------------------


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...

  Average training loss: 1.609
  Training epoch took: 0:00:18
  Training accuracy: 0.341

Training...

  Average training loss: 0.470
  Training epoch took: 0:00:18
  Training accuracy: 0.920

Training...

  Average training loss: 0.123
  Training epoch took: 0:00:18
  Training accuracy: 0.971

Running test...
  0: 0.952
  1: 1.000
RoBERTa Prediction accuracy: 0.949
  Test Loss: 0.235
  Test took: 0:00:04


We can now get some performance stat to see how we did. You can check the meaning of each stat, how to interpret them, and when to use them, in scholarly journal [Wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall), or check out [this paper](https://d1wqtxts1xzle7.cloudfront.net/37219940/5215ijdkp01-libre.pdf?1428316763=&response-content-disposition=inline%3B+filename%3DA_REVIEW_ON_EVALUATION_METRICS_FOR_DATA.pdf&Expires=1709137264&Signature=f3EFHnlTZXa38ug6~VBumSZrfe9ECAyMUh04CNTzYnXEsVaJS3T12eNPbu7iNP~z3DSTTJ2NAV845v50XBe8Sjm7AylacfjGxcQ8YqaDsMulhkCV8c-JtTrWaLILlSUzbQp9M5Md3ubChx5Y9xkBp~s~XlecEEu9B5QEOjyr2aiZRA6gz98crSv0VKKV2ow986UxoSaWZgaYPmTsTrWU2EN3-0S1~OyO9tf2eFqbb3jUwOl15vX1rzzoG9lcpqbURB0eGMqPlXoWPHYBAlGmvUJOGxfkz15VpCxYtg-RoL5IYJONHlkV8GDWXntOm4WdY-ZIcgF3f3c7XhpDzgzvGw__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA).

Let's see how we did:

In [17]:
print(fold_stats)

[{'fold': 1, 'Training Loss': 0.17780055947925733, 'Test Loss': 0.23261169429336276, 'Test Accur.': 0.9347826086956522, '0 Accur.': 0.9615384615384616, '1 Accur.': 1.0, 'f1': [[0.9433962264150944, 0.9655172413793104, 1.0, 0.8695652173913043, 0.967741935483871, 0.8333333333333334], 0.9299256590004855], 'prec': [[0.9259259259259259, 0.9333333333333333, 1.0], 0.9324786324786324], 'rec': [[0.9615384615384616, 1.0, 1.0], 0.9304122574955908]}, {'fold': 2, 'Training Loss': 0.0868152509699913, 'Test Loss': 0.20842399623777186, 'Test Accur.': 0.9565217391304348, '0 Accur.': 1.0, '1 Accur.': 0.9444444444444444, 'f1': [[0.9841269841269841, 0.9714285714285714, 0.9375, 0.9310344827586207, 0.9411764705882353, 0.972972972972973], 0.9563732469792307], 'prec': [[0.96875, 1.0, 0.9375], 0.9533912247092827], 'rec': [[1.0, 0.9444444444444444, 0.9375], 0.9600602343059239]}, {'fold': 3, 'Training Loss': 0.12294541967465826, 'Test Loss': 0.2347283914419157, 'Test Accur.': 0.9492753623188406, '0 Accur.': 0.952

In [23]:
econ_news_stats = []
econ_news_stats.append(
    {
        'Model': model_name,
        'lr': lr,
        'epochs': epochs,
        'batch_size': batch_size,

        'overall_mean': np.mean([x['Test Accur.'] for x in fold_stats ]),
        'overall_mean_sd': np.std([x['Test Accur.'] for x in fold_stats ]),
        'overall_mean_f1': np.mean([x['f1'][1] for x in fold_stats ]),
        'overall_mean_f1_sd': np.std([x['f1'][1] for x in fold_stats ]),
        'overall_recall': np.mean([x['rec'][1] for x in fold_stats ]),
        'overall_recall_sd': np.std([x['rec'][1] for x in fold_stats ]),
        'overall_prec': np.mean([x['prec'][1] for x in fold_stats ]),
        'overall_prec_sd': np.std([x['prec'][1] for x in fold_stats ]),

        'f1_0': [f['f1'][0][0] for f in fold_stats],
        'f1_1': [f['f1'][0][1] for f in fold_stats],
        'f1_2': [f['f1'][0][2] for f in fold_stats],
        'f1_3': [f['f1'][0][3] for f in fold_stats],
        'f1_4': [f['f1'][0][4] for f in fold_stats],
        'f1_5': [f['f1'][0][5] for f in fold_stats],
    }
)

In [24]:
print(econ_news_stats)

[{'Model': 'roberta-base', 'lr': 2e-05, 'epochs': 3, 'batch_size': 4, 'overall_mean': 0.9468599033816426, 'overall_mean_sd': 0.009037819774816274, 'overall_mean_f1': 0.9460119582892826, 'overall_mean_f1_sd': 0.011531132850998554, 'overall_recall': 0.94781591909226, 'overall_recall_sd': 0.012641592029523998, 'overall_prec': 0.9456002902063094, 'overall_prec_sd': 0.009332616140009985, 'f1_0': [0.9433962264150944, 0.9841269841269841, 0.9302325581395349], 'f1_1': [0.9655172413793104, 0.9714285714285714, 1.0], 'f1_2': [1.0, 0.9375, 0.9583333333333334], 'f1_3': [0.8695652173913043, 0.9310344827586207, 0.8888888888888888], 'f1_4': [0.967741935483871, 0.9411764705882353, 0.9615384615384616], 'f1_5': [0.8333333333333334, 0.972972972972973, 0.9714285714285714]}]


## 2.2. Final Model

AMAZEBALLZ!! Ok, now that we know that our model is the shit, we can use ALL the data to train a final model that we can then use to classify whatever (english news) corpus we want.

In [25]:
###########################################################################
### YOU NEED TO CHANGE THE NUMBER OF N LABELS!!!!!!!!!!!!               ###
###########################################################################


# ======================================== #
#               Training                   #
# ======================================== #

timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H%M')

# Define data loaders for training and testing data in this fold
train_dataloader = torch.utils.data.DataLoader(
                       dataset,
                       batch_size=batch_size)

# Initiate model parameters for each fold
model = RobertaForSequenceClassification.from_pretrained(model_name, num_labels=6) # <- CHANGE THE NUMBER OF LABELS!!!
device = torch.device('cuda:0')
desc = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr = lr, eps = 1e-8)
total_steps = (int(len(dataset)/batch_size)+1) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 3, num_training_steps = total_steps)

    # Run the training loop for defined number of epochs
for epoch_i in range(0, epochs):
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')
    t0 = time.time()
    total_train_loss = 0 # Reset the total loss for this epoch.
    model.train() # Put the model into training mode.
    update_interval = good_update_interval( # Pick an interval on which to print progress updates.
                    total_iters = len(train_dataloader),
                    num_desired_updates = 10)

    predictions_t, true_labels_t = [], []
    for step, batch in enumerate(train_dataloader):
        if (step % update_interval) == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed), end='\r')
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        # Always clear any previously calculated gradients before performing a backward pass.
        model.zero_grad()
        # Perform a forward pass --returns the loss and the "logits"
        loss = model(b_input_ids,attention_mask=b_input_mask,labels=b_labels)[0]
        logits = model(b_input_ids,attention_mask=b_input_mask,labels=b_labels)[1]

        # Accumulate the training loss over all of the batches so that we can calculate the average loss at the end.
        total_train_loss += loss.item()
        # Perform a backward pass to calculate the gradients.
        loss.backward()
        # Clip the norm of the gradients to 1.0. This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # Update parameters and take a step using the computed gradient.
        optimizer.step()
        # Update the learning rate.
        scheduler.step()

        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # Store predictions and true labels
        predictions_t.append(logits)
        true_labels_t.append(label_ids)

    # Combine the results across all batches.
    flat_predictions_t = np.concatenate(predictions_t, axis=0)
    flat_true_labels_t = np.concatenate(true_labels_t, axis=0)
    # For each sample, pick the label (0, 1) with the highest score.
    predicted_labels_t = np.argmax(flat_predictions_t, axis=1).flatten()
    acc_t = accuracy_score(predicted_labels_t, flat_true_labels_t)

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.3f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
    print("  Training accuracy: {:.3f}".format(acc_t))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...

  Average training loss: 1.235
  Training epoch took: 0:00:29
  Training accuracy: 0.524

Training...

  Average training loss: 0.188
  Training epoch took: 0:00:28
  Training accuracy: 0.942

Training...

  Average training loss: 0.077
  Training epoch took: 0:00:27
  Training accuracy: 0.978


Our model has been trained! No we just need to save it for future use or, in this case, just use it now.

In [29]:
# Save it in wherever FOLDER you want to save it... (it will be heavy though)
model.save_pretrained('/Users/scrubscrub/ppol6801/final_project_github/model/econ_classification_model')

## 2.3. Using the Model

To use the model, we just need to load it (we already loaded it, but I leave the code for when this is not the case), and CLASSIFY!!

In [42]:
# If this was a new session, then you would need to load the tokenizer again...
# tokenizer = RobertaTokenizer.from_pretrained("roberta-base", do_lower_case=True)

# If you saved your model somewhere, this is how you would retrieve it...
my_model = RobertaForSequenceClassification.from_pretrained('/Users/scrubscrub/ppol6801/final_project_github/model/econ_classification_model', num_labels=6)
device = torch.device("cpu")
my_model = model.to(device)

# Now we put our model in evaluation form:
my_model.eval()

label_dict = {3: "inflation", 4: "national debt", 0: "GDP", 2: "housing", 5: "stock market", 1: "employment/wages"}

def predict(sentence):
    sentence = tokenizer.encode_plus(sentence, return_tensors='pt', max_length=max_len, truncation=True)
    outputs = my_model(sentence["input_ids"], attention_mask=sentence["attention_mask"])
    outputs = outputs[0].detach().numpy()
    predicted_label = np.argmax(outputs)
    label = label_dict[predicted_label]
    return predicted_label, label

predict("Trump wins voters on inflation as Biden zeroes in on tariffs jobs NBC News poll CNBC More voters trust Donald Trump than President Joe Biden to deal with inflation and the cost of living their top concerns for the US according to the latest NBC News poll The poll of registered voters nationwide found that of respondents said Trump would better handle inflation and the cost of living while said the same of Biden The survey was taken from April to several days after the release of another hotterthanexpected inflation report indicating consumer prices gradually ticking back up Trump attacked Bidens economic policies immediately following the release of the data.")


(3, 'inflation')

In [44]:
biden= pd.read_csv('https://raw.githubusercontent.com/Gracej12/PPOL6801_econnews/main/data/extracted_biden_news.csv')
biden

# remove na rows
biden.dropna(subset=['extracted_content'], inplace=True)
biden.isna().sum()

biden['predicted_label'] = biden['extracted_content'].apply(predict)

In [45]:
biden

Unnamed: 0.1,Unnamed: 0,title,description,published date,url,source,extracted_content,predicted_label
1,1,Biden-voting counties equal 70% of America's e...,Biden-voting counties equal 70% of America's e...,"Tue, 10 Nov 2020 08:00:00 GMT",https://news.google.com/rss/articles/CBMilAFod...,Brookings Institution,Even with a new president and political party ...,"(0, GDP)"
2,2,Biden will have a long list of economic fixes ...,Biden will have a long list of economic fixes ...,"Wed, 11 Nov 2020 08:00:00 GMT",https://news.google.com/rss/articles/CBMia2h0d...,NBC News,Economists and market analysts say that when P...,"(0, GDP)"
3,3,4. Important issues in the 2020 election - Pew...,4. Important issues in the 2020 election Pew ...,"Thu, 13 Aug 2020 07:00:00 GMT",https://news.google.com/rss/articles/CBMiVmh0d...,Pew Research Center,"With the country in the midst of a recession, ...","(0, GDP)"
4,4,"Biden's economic recovery plan, called Build B...","Biden's economic recovery plan, called Build B...","Tue, 10 Nov 2020 08:00:00 GMT",https://news.google.com/rss/articles/CBMiZ2h0d...,CNBC,America's financial future was definitely on t...,"(4, national debt)"
5,5,Trump paints apocalyptic portrait of life in U...,Trump paints apocalyptic portrait of life in U...,"Wed, 28 Oct 2020 07:00:00 GMT",https://news.google.com/rss/articles/CBMib2h0d...,The Associated Press,WASHINGTON (AP) — The suburbs wouldn’t be the ...,"(2, housing)"
...,...,...,...,...,...,...,...,...
495,495,"Biden vs. Trump Economy Has a Clear Winner, De...","Biden vs. Trump Economy Has a Clear Winner, De...","Sun, 07 Apr 2024 07:00:00 GMT",https://news.google.com/rss/articles/CBMie2h0d...,Bloomberg,"This is Bloomberg Opinion Today, an economic m...","(0, GDP)"
496,496,Goldman Gauges Show Why Biden's Benefit From E...,Goldman Gauges Show Why Biden's Benefit From E...,"Mon, 19 Feb 2024 08:00:00 GMT",https://news.google.com/rss/articles/CBMicGh0d...,Bloomberg,Just as Americans are getting ready to cast ba...,"(0, GDP)"
497,497,"Alabama's Britt blasts Biden on economy, immig...","Alabama's Britt blasts Biden on economy, immig...","Thu, 07 Mar 2024 08:00:00 GMT",https://news.google.com/rss/articles/CBMie2h0d...,Tennessee Lookout,First-term U.S. Sen. Katie Britt of Alabama de...,"(4, national debt)"
498,498,Ex-Trump official admits his prediction about ...,Ex-Trump official admits his prediction about ...,"Fri, 02 Feb 2024 08:00:00 GMT",https://news.google.com/rss/articles/CBMicGh0d...,CNN,1. How relevant is this ad to you?\n\nVideo pl...,"(1, employment/wages)"


In [47]:
from google.colab import drive
drive.mount('/content/drive')

biden.to_csv('/content/drive/MyDrive/PPOL6801_final/classified_biden_news.csv')

Mounted at /content/drive


In [51]:
# now repeating for trump!
trump= pd.read_csv('https://raw.githubusercontent.com/Gracej12/PPOL6801_econnews/main/data/extracted_trump_news.csv')
trump

# remove na rows
trump.dropna(subset=['extracted_content'], inplace=True)
trump.isna().sum()

trump

trump['predicted_label'] = trump['extracted_content'].apply(predict)

In [53]:
trump
trump.to_csv('/content/drive/MyDrive/PPOL6801_final/classified_trump_news.csv')