### Causal relation classification: using Pre-trained BERT model

### Causal relation classification: Trying to find is there a causal relation in the paragraph<br>

A fun example:
<img src="https://i1.wp.com/boingboing.net/wp-content/uploads/2020/11/Screen-Shot-2020-11-15-at-6.15.14-AM.png?fit=1208%2C786&ssl=1" style="width:400px;height:300px">


In this task, we will implement a bert model to classify whether a paragraph contains a causal relation. Instead of using the last hidden state([CLS] token from the last transformer layer) as a paragraph embedding to train a linear classifier on in notebook *1.2-causal-relation-presence-bert-embeddings*, now we will start with a pre-trained BERT model and retrain the full model on our data. To illustrate the difference between this approach and *1.2-causal-relation-presence-bert-embeddings*, the differences are:
* the classifier(here we use Feed-forward Neural network with softmax, vs Logistic regression with Sigmoid) 
* we re-train the full BERT model on our data.

### 1. Data preparation

In [1]:
cd ..

F:\PythonJupyterStudy\CM\CM_Macro\SSIML2021


In [2]:
from collections import OrderedDict 
import itertools
from IPython.display import clear_output
import numpy as np
import matplotlib.pyplot as plt
import os
import pprint
import pandas as pd
import random
from sklearn.utils import shuffle,resample
from sklearn.metrics import classification_report,confusion_matrix,f1_score
from sklearn.model_selection import train_test_split,KFold
from src.data.make_dataset import read_data_file,make_dataset
import time
import datetime
import seaborn as sns
import torch
import tqdm
import warnings
from transformers import BertTokenizer,RobertaTokenizer,BertForSequenceClassification,get_linear_schedule_with_warmup,AdamW, BertConfig, RobertaForSequenceClassification
from torch.utils.data import TensorDataset, random_split, DataLoader, RandomSampler, SequentialSampler
import wandb


warnings.filterwarnings('ignore')

In [3]:
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")
    

#set random seed to keep consistency between different experiments
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
if str(device) == 'cuda':
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)
else:
    torch.manual_seed(seed_val)

There are 1 GPU(s) available.
We will use the GPU: GeForce RTX 2070 SUPER


In [4]:
assert os.path.isdir("csv"), 'The directory "csv" does not exist!'
assert os.path.isdir("txt"), 'The directory "txt" does not exist!'
map_contents = read_data_file("csv/Map_Contents-20200726.csv")
speech_contents = read_data_file("csv/Speech_Contents-20210520.txt")
speeches = read_data_file("csv/Speeches-20210520.txt")

In [5]:
X, y = make_dataset(speeches, speech_contents, map_contents)

skipping file in language fr: 2009-12-01 Sarkozy Elysee (Economy) ann fr.txt
skipping file in language fr: 2009-12-14 Sarkozy Elysee (Economy) ann fr.txt
skipping file in language fr: 2010-04-20 Barroso European Commission ann fr.txt
skipping file in language fr: 2011-01-13 Sarkozy gb ann.txt
skipping file in language nl: 2011-04-06 Rutte FD evenement ann NL.txt
skipping file in language nl: 2011-09-27 Rutte Rijksoverheid ann.txt
skipping file in language nl: 2011-10-28 Knot dnb_01 ANN NL.txt
skipping file in language de: 2012-01-06 Rutte CSU klausurtagung ann G.txt
skipping file in language unk: 2012-07-26 Barroso European Commission.txt
skipping file in language fr: 2012-08-30 Hollande SFM2020 ann fr.txt
skipping file in language fr: 2013-02-19 Hollande SFM2020 ann fr.txt
skipping file in language fr: 2013-04-17 Hollande SFM2020 ann fr.txt
skipping file in language de: 2013-11-21 Merkel Bundesregerung ann g.txt
skipping file in language de: 2014-02-27 Merkel Bundesregerung ann g.txt


### 2. Balance the data

1.First of all, there still are some *Missing value* paragraphs in our data, therefore we need to remove them. <br><br>
2.In addition, our data is highly imbalanced, that's means we have twice as many paragraphs with causal relations compared to paragraphs without causal relations. This issue would make our classifier guess a paragraph contains causal relation with a high probability because it would be less likely to make a mistake, but this is not what we want. Therefore we need to balance our data.<br>

<h1><center>Undersampling VS  Oversampling</center></h1>

![](https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/resampling.png)

There are two common methods of balancing data: Undersampling and Oversampling, the former refers to the random sampling from the class which contains more data in order to make the dataset balanced. The latter is to copy data points from the class with less data, then make the dataset balanced.<br><br>
Both methods have their advantages and disadvantages, where Undersampling will make us discard some existing data, which will cost our model some training opportunities given our small data size. Nonetheless, oversampling "manually" improves the accuracy of the model because some data appear twice in the dataset, which means that the data points that appear in the test set are likely to appear in the training set as well.

In [6]:
def over_sampling(X,y):
    """
    remove Missing value first, then output two balanced dataset (Undersampling and Oversampling)
    Input: X,y before pre-processing
    Output: dataframes after removing missing value, Undersampling and Oversampling
    """
    df = pd.DataFrame({'X':pd.Series(X),'y':pd.Series(y)})
    df_true = df[df['y'] == True]
    df_false = df[df['y'] == False] 
    
    #Upsampling, for the class with less data, copy some data 
    df_false_upsampled = resample(df_false,random_state=seed_val,n_samples=len(df_true),replace=True)
    df_upsampled = pd.concat([df_false_upsampled,df_true])
    df_upsampled = shuffle(df_upsampled)
    
    print('\nWe totally have {} training data after oversampling.'.format(len(df_upsampled)))
    return df_upsampled


def down_sampling():
    """
    
    
    """


def transform_df(df_upsampled):
    #transform label to int
    df_upsampled.loc[df_upsampled['y'] == 'True', 'y'] = 1
    df_upsampled.loc[df_upsampled['y'] == 'False', 'y'] = 0
    df_upsampled.y = df_upsampled.y.astype(int)
    
    #get sentences and label, will use them to do the tokenization
    #sentences = df_upsampled.X.values
    #labels = df_upsampled.y.values
    return df_upsampled

### Model processing flow

![1.1.3Model_Processing_Flow-3.png](attachment:1.1.3Model_Processing_Flow-3.png)


As shown in the Model processing flow above, we will only use over-sampling in a appropriate way: After using kfold spliting training set and test set, only the training set data will be over-sampled, which ensure that we won't have duplicate test data.

### 3. Re-train a end-to-end Pre-trained BERT Classification model


<img src="http://www.mccormickml.com/assets/BERT/padding_and_mask.png" style="width:600px;height:500px">

Our first step is to tokenize the paragraphs -- break them up into word and subwords in the format BERT is comfortable with. This process contains adding [CLS] and [SEP] tokens as well as substituting tokens with their IDs (tokens2IDs). 

After tokenization, tokenized is a list of paragraphs -- each paragraph is represented as a list of tokens. And we want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths). 

However, if we directly send padded to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. Finally, we will re-train the bert model then validate on eval_set, the corresponding visualization will be implemented as well.


#### Preparation functions

In [7]:
def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="black" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
    return

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

#### Transformer functions

In [8]:
def tokenize_process(df,tokenizer,max_length):
    # Tokenize all of the sentences and map the tokens to thier word IDs.
    input_ids = []
    attention_masks = []
    
    sentences = df.X.values
    labels = df.y.values
    for sent in sentences:
        # `encode_plus` will:
        #   (1) Tokenize the sentence.
        #   (2) Prepend the `[CLS]` token to the start.
        #   (3) Append the `[SEP]` token to the end.
        #   (4) Map tokens to their IDs.
        #   (5) Pad or truncate the sentence to `max_length`
        #   (6) Create attention masks for [PAD] tokens.
        encoded_dict = tokenizer.encode_plus(
                            sent,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = max_length,   # Pad & truncate all sentences.
                            pad_to_max_length = True,
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors = 'pt',     # Return pytorch tensors.
                       )
        # Add the encoded sentence to the list.    
        input_ids.append(encoded_dict['input_ids'])
        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])
    
    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)
    # Print sentence 0, now as a list of IDs.
    #print('Check the original paragraph and converted paragrapg: ')
    #print('Original: ', sentences[1])
    #print('Token IDs:', input_ids[1])
    
    dataset = TensorDataset(input_ids, attention_masks, labels)
    return dataset

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

def convert_logits_tolabel(logits):
        pred = []
        for i in logits:
            if i[0]> i[1]:
                pred.append(0)
            else:
                pred.append(1)
        return pred

In [9]:
def Training_and_evaluating(model,train_dataloader,validation_dataloader,optimizer,epochs,scheduler,total_steps):
    print('\nTraining and evaluating the model.')
    
    #store a number of quantities such as training and validation loss,validation accuracy, and timings.
    training_stats = []
    total_t0 = time.time()
    
    #store prediction and true labels
    train_logits = []
    train_label = []
    # For each epoch...
    for epoch_i in range(0, epochs):
        
        # ========================================
        #               Training
        # ========================================
        
        # Perform one full pass over the training set.
    
        print("")
        print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
        print('Training...')
    
        # Measure how long the training epoch takes.
        t0 = time.time()
        # Reset the total loss for this epoch.
        total_train_loss = 0
        model.train()
        
        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
    
            # Progress update every 40 batches.
            if step % 40 == 0 and not step == 0:
                # Calculate elapsed time in minutes.
                elapsed = format_time(time.time() - t0)
                
                # Report progress.
                print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
    
            # Unpack this training batch from our dataloader. 
            #
            # As we unpack the batch, we'll also copy each tensor to the GPU using the 
            # `to` method.
            #
            # `batch` contains three pytorch tensors:
            #   [0]: input ids 
            #   [1]: attention masks
            #   [2]: labels 
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].long().to(device)
    
            model.zero_grad()        
            loss, logits = model(b_input_ids, 
                                 token_type_ids=None, 
                                 attention_mask=b_input_mask, 
                                 labels=b_labels)
            
            logits_ = logits.detach().cpu().numpy()
            label_ids_ = b_labels.to('cpu').numpy()
            
            #store prediction for the last epoch
            if epoch_i == epochs-1:
                train_logits.extend(logits_)
                train_label.extend(label_ids_)
            total_train_loss += loss.item()
    
            # Perform a backward pass to calculate the gradients.
            loss.backward()
            # Clip the norm of the gradients to 1.0.
            # This is to help prevent the "exploding gradients" problem.
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    
            # Update parameters and take a step using the computed gradient.Update the learning rate.
            optimizer.step()
        
        #Update scheduler(lr decay) every epoch
        lr_stat_opt = optimizer.param_groups[0]["lr"] #or lr_stat_scheduler = scheduler.get_last_lr()[0]
        print('current lr is:',lr_stat_opt)
        if scheduler:
            scheduler.step()
        
        wandb.log({"lr": lr_stat_opt})
    
        # Calculate the average loss over all of the batches.
        avg_train_loss = total_train_loss / len(train_dataloader)            
        # Measure how long this epoch took.
        training_time = format_time(time.time() - t0)
    
        print("")
        print("  Average training loss: {0:.2f}".format(avg_train_loss))
        print("  Training epcoh took: {:}".format(training_time))
            
        # ========================================
        #               Validation
        # ========================================
        # After the completion of each training epoch, measure our performance on
        # our validation set.
    
        print("")
        print("Running Validation...")
        eval_pred = []
        eval_label = []
        t0 = time.time()
        # Put the model in evaluation mode--the dropout layers behave differently
        # during evaluation.
        model.eval()
        # Tracking variables 
        total_eval_accuracy = 0
        total_eval_loss = 0
        nb_eval_steps = 0
        # Evaluate data for one epoch
        for batch in validation_dataloader:
            
            # Unpack this training batch from our dataloader. 
            # As we unpack the batch, we'll also copy each tensor to the GPU using the `to` method.
            # `batch` contains three pytorch tensors: [0]: input ids; [1]: attention masks; [2]: labels 
            b_input_ids = batch[0].to(device)
            b_input_mask = batch[1].to(device)
            b_labels = batch[2].long().to(device)
            
            # Tell pytorch not to bother with constructing the compute graph during
            # the forward pass, since this is only needed for backprop (training).
            with torch.no_grad():        
                # Forward pass, calculate logit predictions.
                # token_type_ids is the same as the "segment ids", which 
                # differentiates sentence 1 and 2 in 2-sentence tasks.
                # The documentation for this `model` function is here: 
                # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
                # Get the "logits" output by the model. The "logits" are the output
                # values prior to applying an activation function like the softmax.
                (loss, logits) = model(b_input_ids, 
                                       token_type_ids=None, 
                                       attention_mask=b_input_mask,
                                       labels=b_labels)
                
            # Accumulate the validation loss.
            total_eval_loss += loss.item()
            # Move logits and labels to CPU
            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()
            # Calculate the accuracy for this batch of test sentences, and
            # accumulate it over all batches.
            #total_eval_accuracy += flat_accuracy(logits, label_ids)
            eval_pred.extend(logits)
            eval_label.extend(label_ids)
        

        eval_pred = convert_logits_tolabel(eval_pred)
        f1_val = f1_score(eval_label,eval_pred,average='macro')
        
        
        # Report the final accuracy for this validation run.
        #avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
        print("  Macro F1 score: {0:.2f}".format(f1_val))
        # Calculate the average loss over all of the batches.
        avg_val_loss = total_eval_loss / len(validation_dataloader)
        # Measure how long the validation run took.
        validation_time = format_time(time.time() - t0)
        print("  Validation Loss: {0:.2f}".format(avg_val_loss))
        print("  Validation took: {:}".format(validation_time))
        
        wandb.log({"Training Loss": avg_train_loss, "Valid. Loss":avg_val_loss,"Valid. Macro F1":f1_val, "epoch": epoch_i + 1 })
        
        # Record all statistics from this epoch.
        training_stats.append(
            {
                'epoch': epoch_i + 1,
                'Training Loss': avg_train_loss,
                'Valid. Loss': avg_val_loss,
                'Valid. Macro F1.': f1_val,
                'Training Time': training_time,
                'Validation Time': validation_time
            }
        )
    
    print("")
    print("Training complete!")
    
    print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))
    
    pd.set_option('precision', 2)
    # Create a DataFrame from our training statistics.
    df_stats = pd.DataFrame(data=training_stats)
    # Use the 'epoch' as the row index.
    df_stats = df_stats.set_index('epoch')
    
    # Plot the learning curve.
    plt.plot(df_stats['Training Loss'], 'b-o', label="Training")
    plt.plot(df_stats['Valid. Loss'], 'g-o', label="Validation")
    
    # Label the plot.
    plt.title("Training & Validation Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.legend()
    epoch_list = [i+1 for i in range(epochs)]
    plt.xticks(epoch_list)
    
    plt.show()
    
    
    train_pred = convert_logits_tolabel(train_logits)
    
    return model,train_pred,eval_pred,train_label,eval_label

def get_prediction(df_test,model,batch_size,max_length,model_name):
    if model_name=='bert':
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    elif model_name=='roberta':
        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    else:
        raise SystemExit('Invalid model_name, model could only be one of [bert, roberta] ')

    test_dataset = tokenize_process(df_test,tokenizer,max_length)
    test_dataloader = DataLoader(test_dataset,sampler = SequentialSampler(test_dataset),batch_size = batch_size)
    
    test_logits = []
    test_label = []
    model.eval()
    
    for batch in test_dataloader:
        """
        Unpack this training batch from our dataloader.
        `batch` contains three pytorch tensors: [0]: input ids; [1]: attention masks; [2]: labels 
        """
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].long().to(device)
        
        with torch.no_grad():        
            (loss, logits) = model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)
            
            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()
            test_logits.extend(logits)
            test_label.extend(label_ids)
            
    def convert_logits_tolabel(logits):
        pred = []
        for i in logits:
            if i[0]> i[1]:
                pred.append(0)
            else:
                pred.append(1)
        return pred
    
    test_logits = convert_logits_tolabel(test_logits)
            
    return test_logits,test_label

In [10]:
def train_val_test(train_ratio = 0.75, validation_ratio = 0.15, test_ratio = 0.10):
    """
    remove Missing value first, then output two balanced dataset (Undersampling and Oversampling)
    Input: X,y before pre-processing
    Output: dataframes after removing missing value, Undersampling and Oversampling
    """
    print('Preprocessing:\n')
    df = pd.DataFrame({'X':pd.Series(X),'y':pd.Series(y)})
    print('{} na data found'.format(len(df[df['X'].isna() == True].index)))
    df = df.dropna()
    print('na data dropped')
    
    df_train, df_test = train_test_split(df, test_size=1 - train_ratio, random_state = seed_val)
    df_val, df_test = train_test_split(df_test, test_size=test_ratio/(test_ratio + validation_ratio), random_state = seed_val) 
    
    X_train, y_train = df_train['X'], df_train['y']
    X_val, y_val = df_val['X'], df_val['y']
    X_test, y_test = df_test['X'], df_test['y']
    
    print('[X training set shape, X validation set shape, X test set shape]:',y_train.shape, y_val.shape, y_test.shape)

    
    # over-sample the Training set, then transform them to right form
    df_train_upsampled = over_sampling(X_train, y_train)
    df_train_upsampled = transform_df(df_train_upsampled)
    
    # transform testset to right form
    df_val = pd.DataFrame({'X':pd.Series(X_val),'y':pd.Series(y_val)})
    df_val = transform_df(df_val)
    
    df_test = pd.DataFrame({'X':pd.Series(X_test),'y':pd.Series(y_test)})
    df_test = transform_df(df_test)
    
    
    return df_train_upsampled, df_val, df_test

def transformer_cls(df_train,df_val,
             epochs = 10,
             batch_size =16,
             max_length=128,
             model_name='roberta',
             lr = 5e-5,
             weight_decay = 1e-2,
             freeze_layer_count=1,
             scheduler_type='step',
             decayRate=0.75):
    
    
    #wandb update
    wandb.config.max_length = max_length
    wandb.config.model = model_name
    wandb.config.weight_decay = weight_decay
    wandb.config.freeze_layer_count = freeze_layer_count
    wandb.config.optimizer = 'AdamW'
    wandb.config.scheduler_type = scheduler_type
    wandb.config.decayRate = decayRate
    
    
    print('\n======================Doing Bert classification task======================')
    
    """
    step1: Tokenization
    """
    #print('Do step1: Tokenization\n')
    if model_name=='bert':
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    elif model_name=='roberta':
        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    else:
        raise SystemExit('Invalid model_name, model could only be one of [bert, roberta] ')
    
    train_dataset = tokenize_process(df_train,tokenizer,max_length)
    val_dataset = tokenize_process(df_val,tokenizer,max_length)
    
    """
    step2: create dataloader for both training and eval set
    """
    
    batch_size = batch_size
    
    # Create the DataLoaders for our training and validation sets.
    # We'll take training samples in random order. 
    train_dataloader = DataLoader(train_dataset,sampler = RandomSampler(train_dataset), batch_size = batch_size)
    
    # For validation and test the order doesn't matter, so we'll just read them sequentially.
    validation_dataloader = DataLoader(val_dataset,sampler = SequentialSampler(val_dataset),batch_size = batch_size)
    
    """
    step3: load bert model
    """
    #print('Do step3: load bert model\n')
    model,optimizer,scheduler,total_steps = model_and_helper(epochs,train_dataloader,model_name,freeze_layer_count,lr,weight_decay,scheduler_type,decayRate)
    
    """
    step4: Training and evaluating
    """
    #print('Do step4: Training and evaluating\n')
    model,train_pred,eval_pred,train_label,eval_label = Training_and_evaluating(model,train_dataloader,validation_dataloader,optimizer,epochs,scheduler,total_steps)
    
    
    return model,train_pred,eval_pred,train_label,eval_label

In [11]:
def model_and_helper(epochs,train_dataloader,
                     model_name='roberta',
                     freeze_layer_count=1,
                     lr = 5e-5,
                     weight_decay = 1e-2,
                     scheduler_type='linear',
                     decayRate=0.75):
    
    if model_name == 'bert':
        print('\nLoading bert model.')
        model = BertForSequenceClassification.from_pretrained(
            "bert-base-uncased",  num_labels = 2,  output_attentions = False, output_hidden_states = False,  return_dict = False)
        
        
    elif model_name == 'roberta':
        model = RobertaForSequenceClassification.from_pretrained(
            'roberta-base', num_labels = 2, output_attentions = False, output_hidden_states = False, return_dict = False)
        
    else:
        raise SystemExit('Invalid model_name, model could only be one of [bert, roberta] ')
        
    
    
    # Assign GPU if you have
    if str(device)=='cuda':
        model.cuda()
    else:
        pass
    
    # We freeze here the embeddings of the model
    if freeze_layer_count:
        if model_name == 'bert':
            for param in model.bert.embeddings.parameters():
                param.requires_grad = False
    
            if freeze_layer_count != -1:
            # if freeze_layer_count == -1, we only freeze the embedding layer
            # otherwise we freeze the first `freeze_layer_count` encoder layers
                for layer in model.bert.encoder.layer[:freeze_layer_count]:
                    for param in layer.parameters():
                        param.requires_grad = False
                        
        if model_name == 'roberta':
            for param in model.roberta.embeddings.parameters():
                param.requires_grad = False
    
            if freeze_layer_count != -1:
            # if freeze_layer_count == -1, we only freeze the embedding layer
            # otherwise we freeze the first `freeze_layer_count` encoder layers
                for layer in model.roberta.encoder.layer[:freeze_layer_count]:
                    for param in layer.parameters():
                        param.requires_grad = False
            
        
    
    # Note: AdamW is a class from the huggingface library (as opposed to pytorch), the 'W' stands for 'Weight Decay fix"
    optimizer = AdamW(model.parameters(),
                      lr = lr, 
                      eps = 1e-8, # args.adam_epsilon  - default is 1e-8.
                      weight_decay = weight_decay 
                    )
    
    epochs = epochs
    # Total number of training steps is [number of batches] x [number of epochs]. 
    # (Note that this is not the same as the number of training samples).
    total_steps = len(train_dataloader) * epochs
    
    # set learning rate decay
    if scheduler_type=='linear': 
        scheduler = get_linear_schedule_with_warmup(optimizer, 
                                                num_warmup_steps = 0, # Default value
                                                num_training_steps = total_steps)
    elif scheduler_type=='step':
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=decayRate)
        
    elif scheduler_type=='exponential':
        scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decayRate)
        
    elif scheduler_type=='none':
        scheduler = None
    
    else:
        raise SystemExit('Invalid scheduler_type, it could only be one of [linear, step, exponential,none] ')
    

    return model,optimizer,scheduler,total_steps

In [12]:
def evaluate_result(train_pred,train_label,eval_pred,eval_label,test_pred,test_true,args):
    
    training_pred,training_true,evaluation_pred,evaluation_true = [],[],[],[]
    
    training_pred.append(train_pred)
    training_true.append(train_label)
    evaluation_pred.append(eval_pred)
    evaluation_true.append(eval_label)
    
    def flatten(t):
        return [item for sublist in t for item in sublist]
    
    evaluation_true = flatten(evaluation_true)
    evaluation_pred = flatten(evaluation_pred)
    training_true = flatten(training_true)
    training_pred = flatten(training_pred)
    
    
    f1_train = f1_score(training_true, training_pred,average='macro')
    f1_val = f1_score(evaluation_true, evaluation_pred,average='macro')
    f1_test = f1_score(test_true,test_pred,average='macro')
    
    wandb.log({"f1_train": f1_train,"f1_val": f1_val, "f1_test": f1_test})

    print('\nargs:',args)
    target_names = ['class 0', 'class 1']
    #For evaluation data
    print('classification report on test set is:\n')
    clas_reprt_eval = classification_report(test_true, test_pred, target_names=target_names)
    print(clas_reprt_eval)
    
    print('confusion matrix on test set is:\n')
    cm_eval = confusion_matrix(test_true, test_pred)
    plot_confusion_matrix(cm_eval, ['No causal relation', 'Has causal relation'], normalize=False)
    
    return 

In [13]:
def transformer_run(X,y,args):
    
    
    df_train_upsampled, df_val, df_test = train_val_test(args['train_ratio'], args['validation_ratio'], args['test_ratio'])
    

    model,train_pred,eval_pred,train_label,eval_label = transformer_cls(df_train_upsampled,df_val,
                                                                          epochs = args['epochs'],
                                                                          batch_size =args['batch_size'],
                                                                          max_length=args['max_length'],
                                                                          model_name=args['model_name'],
                                                                          lr = args['lr'],
                                                                          weight_decay = args['weight_decay'],
                                                                          freeze_layer_count=args['freeze_layer_count'],
                                                                          scheduler_type=args['scheduler_type'],
                                                                          decayRate=args['decayRate'])
    
    test_pred,test_true = get_prediction(df_test,model,batch_size=args['batch_size'],max_length=args['max_length'],model_name=args['model_name'])
    
    evaluate_result(train_pred,train_label,eval_pred,eval_label,test_pred,test_true,args)

    
    del model
    torch.cuda.empty_cache()
    
    #clear output if you need
    clear_output(wait=True)
    return 

### Fine-tuning the Transformer encoder 

Fine-tuning the transformer is not a simple task, and we may encounter severe overfitting resulting in inadequate generalization of the model, especially when we lack a large amount of data data. Inspired from the blog: https://raphaelb.org/posts/freezing-bert/, freezing layer(s) for transformer model seems resonable when you fine tuning the transformer encoder. In addition, the method of setting a reasonable learning rate and the corresponding decay of the learning rate according to the task will also affect the training of the model. Furthermore we introduced AdamW's optimizer to prevent the occurrence of overfitting. Thus, we will conduct a series of experiments to verify which hyperparameter settings are "optimal" by means of grid search, and we will visualize the training, validation sets' loss values and the corresponding Macro F1 scores. The following are some of the parameters that will be involved:



model_name: ['bert','roberta']
learning rate: [5e-5, 1e-5, 5e-6,1e-6]
freeze_layer_count: [0,-1,2,4,6,8]
learning rate decay methods: ['none','step']

### Use wandb to visualize the result

In [14]:
sweep_config = {
    'method': 'grid'
    }

metric = {
    'name': 'loss',
    'goal': 'minimize'   
    }

sweep_config['metric'] = metric

##### model * lr * frozen_layer * scheduler * 5mins/run
##### The estimated time required is:   2 * 4 * 6 * 2 * 5 / 60 = 8 hours

In [16]:
args = {
    'epochs': {
          'value': 20},
    'batch_size': {
          'value': 20},
    'train_ratio': {
          'value': 0.75},
    'validation_ratio': {
          'value': 0.15},
    'test_ratio': {
          'value': 0.10},
    'max_length': {
          'value': 128},
    'model_name': {
          'values': ['bert','roberta']},
    'lr': {
        'values': [5e-5, 1e-5, 5e-6,1e-6]},
    'weight_decay': {
          'value': 0.05},
     'freeze_layer_count': {
        'values': [0,-1,2,4,6,8]},
    'scheduler_type': {
        'values': ['none','step']},
    'decayRate':{
        'value': 0.75}
    }

sweep_config['parameters'] = args
pprint.pprint(sweep_config)

sweep_id = wandb.sweep(sweep_config, project="CM_Experiments")

{'method': 'grid',
 'metric': {'goal': 'minimize', 'name': 'loss'},
 'parameters': {'batch_size': {'value': 20},
                'decayRate': {'value': 0.75},
                'epochs': {'value': 20},
                'freeze_layer_count': {'values': [0, -1, 2, 4, 6, 8]},
                'lr': {'values': [5e-05, 1e-05, 5e-06, 1e-06]},
                'max_length': {'value': 128},
                'model_name': {'values': ['bert', 'roberta']},
                'scheduler_type': {'values': ['none', 'step']},
                'test_ratio': {'value': 0.1},
                'train_ratio': {'value': 0.75},
                'validation_ratio': {'value': 0.15},
                'weight_decay': {'value': 0.05}}}
Create sweep with ID: eyflnxfh
Sweep URL: https://wandb.ai/tan3/CM_Experiments/sweeps/eyflnxfh


In [17]:
def exp_run(config=None):
    # Initialize a new wandb run
    with wandb.init(anonymous='allow',config=config):
        # If called by wandb.agent, as below,
        # this config will be set by Sweep Controller
        config = wandb.config
        transformer_run(X,y,config)
        return 

In [18]:
#the notebook output will be clear coz RAM limitation
wandb.agent(sweep_id, exp_run)

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
Training Loss,█▆▄▅▃▅▃▅▂▆▂▃▄▃▄▁▃▂▁▄
Valid. Loss,█▆▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
Valid. Macro F1,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
epoch,▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
f1_test,▁
f1_train,▁
f1_val,▁
lr,█▆▅▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁

0,1
Training Loss,0.69547
Valid. Loss,0.71067
Valid. Macro F1,0.22353
epoch,20.0
f1_test,0.17757
f1_train,0.43478
f1_val,0.22353
lr,0.0


wandb: Sweep Agent: Waiting for job.
wandb: Sweep Agent: Exiting.


## RESULT

We validated 2 different models: Roberta and Bert in our experiment with 4 different learning rates and 6 different frozen_layer setting as well as 2 learning_rate decay methods. That's a lot of experiments, we will analyze them step by step!

Please check https://wandb.ai/tan3/CM_Experiments?workspace=user-tan3 if you want to see more results.

### How many layers should we freeze?

Lets see the visualization results:

As can be seen, the box plots below shows the Macro F1 score of both models on validation set regarding different learning rates. That is, it reveals under different learning rates, the distribution of macro f1 scores under the influence of different number of frozen layers.

for the roberta model, freezing 6 layers leads to both highest maximium and median value, while freezing 0 layer yield the most stable result(lowest variance). On the other hand, for the bert model, the highest maxium macro F1 score yielded by the run of freezing 0 layer, followed by the one of freezing 6 layers. However, we observed  For both models, freezing 8 layers seems to be the worst option because it has the largest variance.

Therefore, it is reasonable for us to freeze 6 for roberta model, and 0/6 layer(s) for bert model in the later experiment(s). For the sake of simplicity, we will only choose the non-freezing version for subsequent experiments with Bert.


|<img src="Pics_CM\F1_Bert_ validation_lrs.png" style="width:600px;height:300px">|<img src="Pics_CM\F1_Roberta_ validation_lrs.png" style="width:600px;height:300px"> |
|-|-|

#### Results of roberta(frozen_layers=6)

Then, lets see the result of roberta(frozen_layers=6). As can be seen, graphs below shows the losses on Training and Validation sets, as well as the Macro F1 score on validation and test sets. 

Note: Because it is difficult to see the curve clearly due to the number of runs, we filtered out those runs with large loss magnitude and high risk of overfitting.

|<img src="Pics_CM\Train_Loss_roberta_6_filtered.png" style="width:600px;height:250px">|<img src="Pics_CM\val_Loss_roberta_6_filtered.png" style="width:600px;height:250px"> |
|-|-|

|<img src="Pics_CM\val_f1_roberta_6_filtered.png" style="width:650px;height:250px">| <img src="Pics_CM\f1_test_roberta_6_filtered.png" style="width:600px;height:250px">|
|-|-|


Note: The colors of runs in the graph below are not corresponding to those colors in the graphs above. The graph only shows the settings of hyper-parameters in those runs.

<img src="Pics_CM\special_roberta_6_filtered.png" style="width:950px;height:300px">

#### Results of bert(frozen_layers=0)

Then, lets see the result of bert(frozen_layers=0). As can be seen, graphs below shows the losses on Training and Validation sets, as well as the Macro F1 score on validation and test sets. 

Note: Because it is difficult to see the curve clearly due to the number of runs, we filtered out those runs with large loss magnitude and high risk of overfitting.

|<img src="Pics_CM\Train_Loss_bert_0_filtered.png" style="width:600px;height:250px">|<img src="Pics_CM\val_Loss_bert_0_filtered.png" style="width:600px;height:250px"> |
|-|-|

|<img src="Pics_CM\val_f1_bert_0_filtered.png" style="width:650px;height:250px">| <img src="Pics_CM\f1_test_bert_0_filtered.png" style="width:600px;height:250px">|
|-|-|


Note: The colors of runs in the graph below are not corresponding to those colors in the graphs above. The graph only shows the settings of hyper-parameters in those runs.

<img src="Pics_CM\special_bert_0_filtered.png" style="width:950px;height:300px">

### What Hyper-parameter settings is the optimal one?

To verify what hyperparameter settings the model has better performance and generalization ability, we visualize a total of six settings together for the bert and roberta models in above figures. It seems that "unqiue-sweep-72" is the optimal model, where it achieves more than 70% of the Macro F1 score in the validation set and 63.43% in the test set (even if it is not optimal). However, we cannot exclude that his validation loss exhibits some overfitting, even if the degree is not significant (from 0.68 down to 0.75). We cannot exclude that the overfitting may be due to the small amount of data, so in 1.4 we will perform a sensitivity analysis based on different amounts of data.

|<img src="Pics_CM\Best_six_training.png" style="width:550px;height:250px">| <img src="Pics_CM\Best_six_validation.png" style="width:550px;height:250px">|
|-|-|

<img src="Pics_CM\Best_six_test.png" style="width:550px;height:250px">

<img src="Pics_CM\best_loss_f1.png" style="width:800px;height:250px">

The hyperparameters settings of the optimal model are:

best_args = {
    'epochs': {
          'value': 14},
    'batch_size': {
          'value': 20},
    'train_ratio': {
          'value': 0.75},
    'validation_ratio': {
          'value': 0.15},
    'test_ratio': {
          'value': 0.10},
    'max_length': {
          'value': 128},
    'model_name': {
          'value': 'roberta'},
    'lr': {
        'value': 1e-05},
    'weight_decay': {
          'value': 0.05},
     'freeze_layer_count': {
        'value': 6},
    'scheduler_type': {
        'value': 'step'},
    'decayRate':{
        'value': 0.75}
    }