Parts (training and related imports) taken from **DALC BERTje classification** by:
* Author: Chris McCormick
* Adapted by: Hylke van der Veen

Further adapted by: André Korporaal

## Preparation

### First things first

<!-- - Connect to drive (standard path is `/content/drive/MyDrive/`), should contain (each containing a dir for the sources): -->
- Make sure the `root folder` (which `contains this notebook`) has the following:
    <!-- - `dalc/` for the DALC data files -->
    - `logits/` for the logits to save
    - `data/` containing the `SemEval2023` and `EXIST` .csv files
    <!-- - `saved_models/` for the models to save -->
- Apply GPU hardware accelerator in notebook settings

(Feel free to change the working directory if you want it, then simply replace `'.'` with the `folder location`.)

In [5]:
working_dir = '.' 

### Random seed value for all operations

In [6]:
# Set the seed value all over the place to make this reproducible.
# seed_val = 2022
seed_val = 1234

### Imports

Important installs that are required:
* Install huggingface library (transformers)
* Install jsonlines

*This script has been adapted for local use, therefore this expected to have been pre-installed*

In [7]:
# GPU and training
import torch

# Importing data
import pandas as pd
import numpy as np

# (Pre)process data
import emoji
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, LabelBinarizer

# Retrieving the tokenizer
import transformers
from transformers import AutoTokenizer

# Batch code for dataloader
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset

# Scheduler for warmup
from transformers import get_linear_schedule_with_warmup

# Model creation
from transformers import AdamW, AutoModelForSequenceClassification, AutoConfig

# General code for the training
import random
import jsonlines # Needed for the data cartography output
import os
import glob
import time
import datetime
from pathlib import Path

In [None]:
torch.manual_seed(seed_val)
random.seed(seed_val)
np.random.seed(seed_val)
transformers.set_seed(seed_val)

### GPU

Identify and specify the GPU as the device. 

In [8]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: NVIDIA GeForce GTX 1060


## Training

In [9]:
def preprocess(text):
    '''Removes hashtags and converts links to [URL] and usernames starting with @ to [USER],
    it also converts emojis to their textual form.'''
    documents = []
    for instance in text:
        instance = re.sub(r'@([^ ]*)', '[USER]', instance)
        instance = re.sub(r'((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*', '[URL]', instance)
        instance = emoji.demojize(instance)
        instance = instance.replace('#', '')
        documents.append(instance)
    return documents

In [10]:
def read_data(d1, task_b, d2 = None):
    '''Reading in the dataset and returning it as pandas dataframes
    with only the text and label.'''
    # read in data to pandas
    df1 = pd.read_csv(d1)

    if task_b:
        # remove non-sexist instances, as these are only relevant for task A
        df1 = df1[df1['label_sexist'] == 'sexist']
        # drop columns we don't use
        df1 = df1.drop(columns=['rewire_id', 'label_sexist', 'label_vector'])

        # convert labels to numerical value (non sexist = 0, sexist = 1 )
        # df1.loc[df1.label_category == 'none', 'label_category'] = 0
        df1.loc[df1.label_category == '1. threats, plans to harm and incitement', 'label_category'] = 0
        df1.loc[df1.label_category == '2. derogation', 'label_category'] = 1
        df1.loc[df1.label_category == '3. animosity', 'label_category'] = 2
        df1.loc[df1.label_category == '4. prejudiced discussions', 'label_category'] = 3

        # converting column names
        df1.columns = ['text', 'label']

        if d2:
            df2 = pd.read_csv(d2)

            df2 = df2[df2['sexist'] == 1]
            df2.loc[df2.label_category == '1. threats, plans to harm and incitement', 'label_category'] = 0
            df2.loc[df2.label_category == '2. derogation', 'label_category'] = 1
            df2.loc[df2.label_category == '3. animosity', 'label_category'] = 2
            df2.loc[df2.label_category == '4. prejudiced discussions', 'label_category'] = 3

            df2 = df2.drop(columns=['test_case', 'id', 'source', 'language', 'sexist', 'Unnamed: 0'])

            df2.columns = ['text', 'exist_label', 'label']

            return df1, df2
        else:
            return df1

    else:
        # read in data to pandas
        df2 = pd.read_csv(d2)

        # drop columns we don't use
        df1 = df1.drop(columns=['rewire_id', 'label_category', 'label_vector'])
        df2 = df2.drop(columns=['test_case', 'id', 'source', 'language', 'category', 'Unnamed: 0'])

        # convert labels to numerical value (non sexist = 0, sexist = 1 )
        df1.loc[df1.label_sexist == 'not sexist', 'label_sexist'] = 0
        df1.loc[df1.label_sexist == 'sexist', 'label_sexist'] = 1

        df1.columns = ['text', 'label']
        df2.columns = ['text', 'label']
        return df1, df2


In [None]:
LEARNINGRATE = 1e5
BATCHSIZE = 16
# PADDING = 325 # HateBERT
PADDING = 250 # DeBERTa
# EPOCHS = 5 # HateBERT
EPOCHS = 6 # DeBERTa

In [11]:
# Load the dataset into a pandas dataframe for the specific task
# df_train = pd.read_csv(f"{working_dir}/dalc_abusive/{source.split('_')[0]}/DALC-2_train_{source}.csv")
# df_dev = pd.read_csv(f"{working_dir}/dalc_abusive/{validate}/DALC-2_dev_{validate}.csv")

# ori_train, ori_dev = train_test_split(ori, test_size=0.2, random_state=RANDOM_STATE)
data_file1 = 'data/train_all_tasks.csv'
data_file2 = 'data/EXIST2021_merged.csv'
taskb_state = True
data_used = 'semeval'
# data_used = 'exist'
# data_used = 'shuffle_semeval-exist'
model_used = 'deberta'

if taskb_state:
    
    ori = read_data(data_file1, taskb_state, data_file2)

    df_train, df_dev = train_test_split(ori, test_size=0.2, random_state=seed_val)

else:

    ori, d2 = read_data(data_file1, taskb_state, data_file2)

    if data_used == 'semeval':
        df_train, df_dev = train_test_split(ori, test_size=0.2, random_state=seed_val)
    elif data_used == 'exist':
        df_train_ori, df_dev_ori = train_test_split(ori, test_size=0.2, random_state=seed_val)
        # df_train, df_dev = train_test_split(d2, test_size=0.2, random_state=seed_val)
        df_train = d2
        df_dev = df_dev_ori
    elif data_used == 'shuffle_semeval-exist':
        ori_train, ori_dev = train_test_split(ori, test_size=0.2, random_state=seed_val)
        ori_concat = pd.concat([ori_train, d2], axis=0)
        ori_concat_shuffled = ori_concat.sample(frac=1, random_state=seed_val)

        df_train = ori_concat_shuffled
        df_dev = ori_dev
        
        df_train.reset_index(inplace=True, drop=True)

    else:
        print("WARNING: pick a data type for the output folder structure!")

# Report the number of sentences.
print('Number of training sentences: {:,}'.format(df_train.shape[0]))
print('Number of development sentences: {:,}\n'.format(df_dev.shape[0]))

Number of training sentences: 11,200
Number of development sentences: 2,800



In [12]:
df_output_paths = f'{working_dir}/datafiles_for_cartography/{data_used}/{model_used}/WINOGRANDE'

Path(df_output_paths).mkdir(parents=True, exist_ok=True)

df_train.to_csv(f'{df_output_paths}/train.tsv', sep='\t', index_label='id')
df_dev.to_csv(f'{df_output_paths}/dev.tsv', sep='\t', index_label='id')
df_dev.to_csv(f'{df_output_paths}/test.tsv', sep='\t', index_label='id')

In [13]:
# Display 5 random rows from the data.
df_train.sample(n=5)

Unnamed: 0,text,label
1804,If I were young guy today...thank god I'm not....,1
11952,"Move on, talk to more girls, don't make the sa...",0
3871,She got her equal rights... And lefts 😂,1
10472,This is hard for me because I am still only a ...,0
1534,Thousands of Swedes insert microchips into bod...,0


Extract the sentences and labels of our training set as numpy ndarrays.

In [14]:
# Get the lists of sentences and their labels.
train_sentences = df_train.text.values
train_labels = df_train.label.values
# train_ids = df_train.id.values
train_ids = df_train.index.values

dev_sentences = df_dev.text.values
dev_labels = df_dev.label.values
# dev_ids = df_dev.id.values
dev_ids = df_dev.index.values

In [15]:
print('train:')
print(train_sentences[:2])
print(train_labels[:2])
print(train_ids[:2])

print()

print('dev:')
print(dev_sentences[:2])
print(dev_labels[:2])
print(dev_ids[:2])

train:
['Black woman goes on angry rant against Jewish man on subway [URL]'
 "To quote myself, 'Feminism is the Bastardization of Patriarchy the bending of it's will to the whims of women'"]
[0 0]
[11518  4259]

dev:
['The only reason they do this outdoors is to intimidate people in western nations. They never pull this in their own countries.'
 'I know right? Maybe she likes his strength but his long ass nails are hurting her when he fingers her.']
[0 0]
[3429 9829]


In [16]:
train_sentences = np.array(preprocess(train_sentences))
dev_sentences = np.array(preprocess(dev_sentences))

In [17]:
train_labels = train_labels.astype('int32')
dev_labels = dev_labels.astype('int32')

In [18]:
train_labels[10:20]

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int32)

In [19]:
dev_labels[10:20]

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 0], dtype=int32)

### Tokenize

In [20]:
# Load the BERT tokenizer.
print('Loading DeBERTa tokenizer...')
# tokenizer = BertTokenizer.from_pretrained('GroNLP/bert-base-dutch-cased', additional_special_tokens=['@USER', 'URL'])
tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-base', additional_special_tokens=['@USER', 'URL'])

print('Special tokens: ', tokenizer.get_added_vocab())

Loading DeBERTa tokenizer...


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Special tokens:  {'@USER': 128001, '[MASK]': 128000}


#### Tokenize train

In [21]:
# Tokenize all of the sentences and map the tokens to their word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in train_sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode, with added custom tokens for tweets.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        # max_length = 280,          # Pad & truncate all sentences.
                        max_length = PADDING,          # Pad & truncate all sentences.
                        padding='max_length',
                        truncation=True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
train_input_ids = torch.cat(input_ids, dim=0)
train_attention_masks = torch.cat(attention_masks, dim=0)
train_labels = torch.tensor(train_labels)
train_ids = torch.tensor(train_ids)

#### Tokenize dev

In [22]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in dev_sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode, with added custom tokens for tweets.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        # max_length = 280,          # Pad & truncate all sentences.
                        max_length = PADDING,          # Pad & truncate all sentences.
                        padding='max_length',
                        truncation=True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
dev_input_ids = torch.cat(input_ids, dim=0)
dev_attention_masks = torch.cat(attention_masks, dim=0)
dev_labels = torch.tensor(dev_labels)
dev_ids = torch.tensor(dev_ids)

### Create batches for the train dataloader

In [23]:
# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
#
# André: optimal would be 16 or 32, but to keep the overhead small, 
# I tried it with 8 here as I was running it on my laptop gpu.
batch_size = BATCHSIZE

# Combine the training inputs into a TensorDataset.
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels, train_ids)
dev_dataset = TensorDataset(dev_input_ids, dev_attention_masks, dev_labels, dev_ids)

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
        train_dataset,  # The training samples.
        sampler = SequentialSampler(train_dataset),  # Training data is already randomized
        batch_size = batch_size  # Train with this batch size.
    )

# For validation the order doesn't matter, so we'll just read them sequentially.
dev_dataloader = DataLoader(
            dev_dataset, # The validation samples.
            sampler = SequentialSampler(dev_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

print('{:>5,} training samples'.format(len(train_ids)))
print('{:>5,} development samples'.format(len(dev_ids)))

11,200 training samples
2,800 development samples


### Create model

In [24]:
# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = AutoModelForSequenceClassification.from_pretrained(
    # "GroNLP/bert-base-dutch-cased",  # Use the 12-layer BERT model, with an uncased vocab.
    "microsoft/deberta-v3-base",
    num_labels = 5,                  # The number of output labels--2 for binary classification.
    output_attentions = False,       # Whether the model returns attentions weights.
    output_hidden_states = False,    # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()

# Resizes input token embeddings matrix to account for the new special tokens
model.resize_token_embeddings(len(tokenizer))

Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.classifier.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'mask_predictions.classifier.weight', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a

Embedding(128002, 768)

In [25]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 202 different named parameters.

==== Embedding Layer ====

deberta.embeddings.word_embeddings.weight               (128002, 768)
deberta.embeddings.LayerNorm.weight                           (768,)
deberta.embeddings.LayerNorm.bias                             (768,)
deberta.encoder.layer.0.attention.self.query_proj.weight   (768, 768)
deberta.encoder.layer.0.attention.self.query_proj.bias        (768,)

==== First Transformer ====

deberta.encoder.layer.0.attention.self.key_proj.weight    (768, 768)
deberta.encoder.layer.0.attention.self.key_proj.bias          (768,)
deberta.encoder.layer.0.attention.self.value_proj.weight   (768, 768)
deberta.encoder.layer.0.attention.self.value_proj.bias        (768,)
deberta.encoder.layer.0.attention.output.dense.weight     (768, 768)
deberta.encoder.layer.0.attention.output.dense.bias           (768,)
deberta.encoder.layer.0.attention.output.LayerNorm.weight       (768,)
deberta.encoder.layer.0.attention.output.LayerNorm.bias   

### Set parameters

Set optimizer

In [26]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  # lr = 1e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  lr = LEARNINGRATE, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )




Create scheduler

In [23]:
# Number of training epochs. The BERT authors recommend between 2 and 4.
# DONE UP TOP
# epochs = 5
epochs = EPOCHS

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 2, # Default value in run_glue.py
                                            num_training_steps = total_steps)

### Train

In [24]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [25]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [26]:
# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Remove all old logit files
# files = glob.glob(f'{working_dir}/logits/{source}/*')
Path(f'{working_dir}/logits/{data_used}/{model_used}/training_dynamics').mkdir(parents=True, exist_ok=True)
files = glob.glob(f'{working_dir}/logits/{data_used}/{model_used}/training_dynamics/*')
for file in files:
    os.remove(file)
    
# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains four pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        #   [3]: tweet ids
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        # b_labels = batch[2].to(device)
        b_labels = batch[2].type(torch.LongTensor).to(device)
        b_ids = batch[3].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # In PyTorch, calling `model` will in turn call the model's `forward` 
        # function and pass down the arguments. The `forward` function is 
        # documented here: 
        # https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification
        # The results are returned in a results object, documented here:
        # https://huggingface.co/transformers/main_classes/output.html#transformers.modeling_outputs.SequenceClassifierOutput
        # Specifically, we'll get the loss (because we provided labels) and the
        # "logits"--the model outputs prior to activation.
        result = model(b_input_ids, 
                       token_type_ids=None, 
                       attention_mask=b_input_mask, 
                       labels=b_labels,
                       return_dict=True)

        loss = result.loss
        logits = result.logits
        
        # 'logits' is a 2D tensor with lists of logit lists
        # 'b_labels' is a 1D tensor of the true label
        # 'b_ids' is a 1D tensor of the tweet IDs

        # Add the logits from this batch to the training dynamics file for this epoch
        # with jsonlines.open(f'{working_dir}/logits/{source.split("_")[0]}/dynamics_epoch_{epoch_i}.jsonl', mode='a') as writer:
        with jsonlines.open(f'{working_dir}/logits/{data_used}/{model_used}/training_dynamics/dynamics_epoch_{epoch_i}.jsonl', mode='a') as writer:
            for i in range(len(b_labels)):   # Batch size
                writer.write({"guid": int(b_ids[i]), 
                              f"logits_epoch_{epoch_i}": logits[i].tolist(), 
                              "gold": int(b_labels[i])})

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.3f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))

    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in dev_dataloader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        #   [3]: tweet ids
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        # b_labels = batch[2].to(device)
        b_labels = batch[2].type(torch.LongTensor).to(device)
        b_ids = batch[3].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            result = model(b_input_ids, 
                           token_type_ids=None, 
                           attention_mask=b_input_mask,
                           labels=b_labels,
                           return_dict=True)

        # Get the loss and "logits" output by the model. The "logits" are the 
        # output values prior to applying an activation function like the 
        # softmax.
        loss = result.loss
        logits = result.logits
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(dev_dataloader)
    print("  Accuracy: {0:.3f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(dev_dataloader)
    
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.3f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...
  Batch    40  of    706.    Elapsed: 0:00:26.
  Batch    80  of    706.    Elapsed: 0:00:51.
  Batch   120  of    706.    Elapsed: 0:01:17.
  Batch   160  of    706.    Elapsed: 0:01:43.
  Batch   200  of    706.    Elapsed: 0:02:08.
  Batch   240  of    706.    Elapsed: 0:02:34.
  Batch   280  of    706.    Elapsed: 0:03:00.
  Batch   320  of    706.    Elapsed: 0:03:26.
  Batch   360  of    706.    Elapsed: 0:03:51.
  Batch   400  of    706.    Elapsed: 0:04:17.
  Batch   440  of    706.    Elapsed: 0:04:43.
  Batch   480  of    706.    Elapsed: 0:05:09.
  Batch   520  of    706.    Elapsed: 0:05:35.
  Batch   560  of    706.    Elapsed: 0:06:00.
  Batch   600  of    706.    Elapsed: 0:06:26.
  Batch   640  of    706.    Elapsed: 0:06:52.
  Batch   680  of    706.    Elapsed: 0:07:18.

  Average training loss: 0.532
  Training epoch took: 0:07:34

Running Validation...
  Accuracy: 0.692
  Validation Loss: 0.564
  Validation took: 0:01:24

Training...
  Batch    40  of  