Made by:
* Alberto Cano Turnes


# **Multi-Author Writing Style Analysis (style change detection) 2023**

# **1. Importing Python Libraries and preparing the environment**

At this step we will be importing the libraries and modules needed to run our script. Libraries are:
* Transformers
* Transformers tokenizers for RoBERTa, BERT and DistilBERT
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Numpy
* Pandas
* os
* re
* json

In [1]:
import transformers
from transformers import RobertaTokenizer, BertTokenizer, DistilBertTokenizer
import torch
from torch import cuda
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
import numpy as np
import pandas as pd
from sklearn import metrics
import os
import re
import json

  from .autonotebook import tqdm as notebook_tqdm


## Setting constant variables

We set some constant variables that will be used for the Dataloader and for pre-processing the data. We also set the tokenizer and cuda as the device (in case we have a GPU, otherwise, the CPU will be used).

In [2]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 128 # @param {type:"integer"}
TRAIN_BATCH_SIZE = 32 # @param {type:"integer"}
VALID_BATCH_SIZE = 16 # @param {type:"integer"}
EPOCHS = 8 # @param {type:"integer"}
LEARNING_RATE = 1e-5 # @param {type:"number"}

# Defining paths to the data
TRAIN_EASY_PATH="./data/dataset_easy/train/" # @param {type:"string"}
TRAIN_MEDIUM_PATH="./data/dataset_medium/train/" # @param {type:"string"}
TRAIN_HARD_PATH="./data/dataset_hard/train/" # @param {type:"string"}
VALIDATION_EASY_PATH="./data/dataset_easy/validation/" # @param {type:"string"}
VALIDATION_MEDIUM_PATH="./data/dataset_medium/validation/" # @param {type:"string"}
VALIDATION_HARD_PATH="./data/dataset_hard/validation/" # @param {type:"string"}

# Setting the tokenizer to be used
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Setting up the device for GPU usage
device = 'cuda' if cuda.is_available() else 'cpu'

# **2. Importing and Pre-Processing the domain data**

In this next part, we will do the necessary steps to load the dataset and pre-process it in order to carry out the chosen task.

## Functions to import and pre-process the data

In this cell below, we declare some functions that we will be used to import and pre-process the data provided. The functions as follows:

* `txt_to_array(txt_path)`: This function receives the path of the txt files with the paragraphs and return its content as a list of paragraphs.

* `get_json_labels(json_path)`: This function receives the path of the json files with the labels and return its content as a dictionary.

* `load_files(data_path)`: This function loads the txt and json files from a given path.

* `concatenate_paragraphs(df)`: This function concatenate adjacent paragraphs from a dataframe into a new dataframe.

* `create_csv(train_path, validation_path)`: This function create the final CSV file from text and label files that will be used to train and test the models.

Since we are asked to check if 2 consecutive paragraphs were written by the same author, we think that splitting the data into consecutive paragraphs and concetenate is a good idea,since this will allow us to transform the task into a binary classification problem where 0 means the paragraphs were written by the same author and 1 means they were written by different.

In [3]:
# Function to read a text file and return its content as a list of paragraphs
def txt_to_array(txt_path):
    with open(txt_path, 'r', encoding='utf-8') as file:
        content = file.read()
    paragraph_list = content.split('\n')

    return paragraph_list

# Function to read a JSON file and return its content as a dictionary
def get_json_labels(json_path):
    with open(json_path, 'r', encoding='utf-8') as file:
        content = file.read()
        
    return json.loads(content)

# Function to load text and label files from a given directory
def load_files(data_path):
    paragraph4files = {}
    labels4files = {}
    for file in os.listdir(data_path):
        if file.startswith('problem-') and file.endswith('.txt'):
            matches = re.search(r'problem-(\d+).txt', file)

            if matches:
                file_number = int(matches.group(1))
                file_path = os.path.join(data_path, file)
                labels_path = os.path.join(data_path,f"truth-problem-{file_number}.json")

                paragraph_list = txt_to_array(file_path)
                paragraph4files[file_number-1] = paragraph_list
                labels4files[file_number-1] = get_json_labels(labels_path)

    return paragraph4files, labels4files

# Function to concatenate adjacent paragraphs in a dataframe
def concatenate_paragraphs(df):
    new_texts = []
    new_changes = []
    for i in range(len(df) - 1):
        combined_texts = []
        changes = []
        for j in range(len(df['texts'][i]) - 1):
            combined_text = df['texts'][i][j] + '[CLS]' + df['texts'][i][j + 1] # [CLS] token is used to separate paragraphs so that the model can learn to distinguish between them
            change = [df['changes'][i][j]]
            combined_texts.append(combined_text)
            changes.append(change)

        new_texts.extend(combined_texts)
        new_changes.extend(changes)

    return new_texts, new_changes

# Function to create a CSV file from text and label files
def create_csv(train_path, validation_path):
    train_paragraphs, train_labels = load_files(train_path)
    validation_paragraphs, validation_labels = load_files(validation_path)

    train_texts = list(train_paragraphs.values())
    train_changes = [v['changes'] for k, v in train_labels.items()]

    validation_texts = list(validation_paragraphs.values())
    validation_changes = [v['changes'] for k, v in validation_labels.items()]

    df_train = pd.DataFrame({'texts': train_texts, 'changes': train_changes})
    df_validation = pd.DataFrame({'texts': validation_texts, 'changes': validation_changes})

    concatenated_train_paragraphs, paragraph_train_change = concatenate_paragraphs(df_train)
    concatenated_validation_paragraphs, paragraph_validation_change = concatenate_paragraphs(df_validation)

    df_train = pd.DataFrame({'texts': concatenated_train_paragraphs, 'changes': paragraph_train_change})
    df_validation = pd.DataFrame({'texts': concatenated_validation_paragraphs, 'changes': paragraph_validation_change})
    
    df_train.to_csv('data/dataset_csv/' + train_path.split('/')[2].lower() + '_train.csv', index=False)
    df_validation.to_csv('data/dataset_csv/' + validation_path.split('/')[2].lower()+ '_validation.csv', index=False)

    return df_train, df_validation

## Creating the dataframes for each difficulty

We create 6 datasets from the data provided, 3 for training and 3 for validation. This 6 datasets are based on the difficulty of the data provided:
* Easy: We create `easy_df_train` and `easy_validation_df` csv.
* Medium: We create `medium_df_train` and `medium_df_validation` csv.
* Hard: We create `hard_df_train` and `hard_df_validation` csv.

In [4]:
# Create easy train dataset and validation dataset from data folder and save them as csv files
easy_df_train, easy_df_validation = create_csv(TRAIN_EASY_PATH, VALIDATION_EASY_PATH)

print("Easy train dataset:")
print(easy_df_train.head())

print("\n\nEasy validation dataset:")
print(easy_df_validation.head())

Easy train dataset:
                                               texts changes
0  I'm not arguing with you here, I'm simply tryi...     [1]
1  He's at my place half the time and his fiance(...     [0]
2  Biden has the power to write an executive orde...     [1]
3  r/politics is currently accepting new moderato...     [1]
4  The inflation reduction act increases inflatio...     [0]


Easy validation dataset:
                                               texts changes
0  I think the premise of your question is off to...     [1]
1  You bought this in good faith and without clai...     [1]
2  Yes. That section of the code says what it say...     [1]
3  She doesnt “savage” anyone, no one is “fuming”...     [1]
4  I left Ohio in 1998. Never looked back. I've l...     [1]


In [22]:
# Create medium train dataset and validation dataset from data folder and save them as csv files
medium_df_train, medium_df_validation = create_csv(TRAIN_MEDIUM_PATH, VALIDATION_MEDIUM_PATH)

print("Medium train dataset:")
print(medium_df_train.head())

print("\n\nMedium validation dataset:")
print(medium_df_validation.head())

Medium train dataset:
                                               texts changes
0  Nevada has some very generous laws for squatte...     [0]
1  The timing of their entry is very suspect and ...     [0]
2  There may be sensitive information related to ...     [1]
3  Thank you for the info about the eviction firm...     [1]
4  Also the equipment they're largely relying on ...     [1]


Medium validation dataset:
                                               texts changes
0  I asked them about something of that nature an...     [1]
1  You probably need to secure the waiver from th...     [0]
2  I don’t know which school you’re at but at lea...     [1]
3  I have been involved in appeals for classes fo...     [1]
4  Did you start petitioning with the disability ...     [1]


In [23]:
# Create hard train dataset and validation dataset from data folder and save them as csv files
hard_df_train, hard_df_validation = create_csv(TRAIN_HARD_PATH, VALIDATION_HARD_PATH)

print("Hard train dataset:")
print(hard_df_train.head())

print("\n\nHard validation dataset:")
print(hard_df_validation.head())

Hard train dataset:
                                               texts changes
0  Great post. I'd add that another aspect of com...     [1]
1  A major philosophical challenger to this versi...     [0]
2  One additional irony that I would add is that ...     [0]
3  This leaves us with the third "national projec...     [0]
4  Im also well aware of the numerous atrocities ...     [0]


Hard validation dataset:
                                               texts changes
0  Everybody knows t was the leader of the senate...     [0]
1  Naw it came off in a negative way. Like making...     [1]
2  And I think there may also be other tactics, l...     [1]
3  Look… it’s partly true what you’re saying but,...     [1]
4  Maybe not lawmakers, but if a resident of Oreg...     [1]


If we see the changes of each dataframe, we noticed that the `easy_df_train` is unbalanced (theres much more 1 than 0). Because of this, we will create a sampler to balance this.

In [24]:
easy_df_train["changes"].value_counts()

changes
[1]    11345
[0]     1557
Name: count, dtype: int64

In [25]:
medium_df_train["changes"].value_counts()

changes
[0]    15000
[1]    13211
Name: count, dtype: int64

In [26]:
hard_df_train["changes"].value_counts()

changes
[0]    10090
[1]     9018
Name: count, dtype: int64

## Custom dataset class

The class created below has been designed to preprocess text from the dataframes, encode it using a tokenizer (in our case we are using the RoBERTa tokenizer), and prepare the data to use it in the models.

In [5]:
class ParagraphDataset(Dataset):
    # Initialize the dataset
    def __init__(self, dataframe, tokenizer, max_len):
        
        self.tokenizer = tokenizer  # Store the provided tokenizer
        self.data = dataframe  # Store the dataframe
        self.texts = dataframe.texts  # Extract the 'texts' column from the dataframe
        self.targets = self.data.changes  # Extract the 'changes' column from the dataframe
        self.max_len = max_len  # Store the maximum length

    # Return the length of the dataset based on the number of items in 'texts'
    def __len__(self):
        return len(self.texts)

    # Get a specific item from the dataset based on the index 'idx'
    def __getitem__(self, idx):
        texts = self.texts[idx]  # Fetch the text corresponding to the index
        
        # Tokenize the text using the provided tokenizer
        encoding = self.tokenizer.encode_plus(
            texts,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = encoding['input_ids']  # Extract token IDs
        mask = encoding['attention_mask']  # Extract attention masks

        # Return a dictionary containing token IDs, attention masks, target values, and the original text
        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.targets[idx]),
            'texts': texts
        }

## Dataloader functions

In this cell below, we declare some functions that we will be used to load the data into the neural network in a defined manner. The functions are as follows:

* `create_sampler(df_train)`: This function receives the dataframe used for training and creates a weighted random sampler based on the class distribution.

* `create_dataloader(difficulty, df_train, df_validation, tokenizer, sampler)`: This function receives both train and validation dataframes, the tokenizer and the sampler and creates the training and validation dataloaders that loads the data into the neural network.



In [6]:
# Function to create a weighted random sampler based on the class distribution in the training dataset
def create_sampler(df_train):
    
    if isinstance(df_train['changes'].iloc[0], list):
        df_train['changes'] = df_train['changes'].apply(lambda x: x[0])

    class_counts = df_train['changes'].value_counts()
    num_samples = len(df_train)

    weights = 1. / class_counts[df_train['changes']]
    weights = weights.tolist()
    
    sampler = WeightedRandomSampler(weights, num_samples)

    return sampler

# Function to create data loaders for training and testing datasets
def create_dataloader(df_train, df_validation, tokenizer, sampler):

    train_dataset = df_train.sample(frac=1, random_state=200)
    test_dataset = df_validation.sample(frac=1)
    train_dataset = train_dataset.reset_index(drop=True)
    test_dataset = test_dataset.reset_index(drop=True)
    
    print("FULL Dataset: {}".format((df_train.shape[0] + df_validation.shape[0], df_train.shape[1])))
    print("TRAIN Dataset: {}".format(train_dataset.shape))
    print("TEST Dataset: {}\n".format(test_dataset.shape))

    training_set = ParagraphDataset(train_dataset, tokenizer, MAX_LEN)
    testing_set = ParagraphDataset(test_dataset, tokenizer, MAX_LEN)

    train_params = {'batch_size': TRAIN_BATCH_SIZE,
                    'sampler': sampler,
                    'num_workers': 0,
                    }

    test_params = {'batch_size': VALID_BATCH_SIZE,
                    'num_workers': 0,
                    }

    training_loader = DataLoader(training_set, **train_params)
    testing_loader = DataLoader(testing_set, **test_params)

    return training_loader, testing_loader

## Creating the dataloaders for each difficulty

We create 6 dataloaders from the dataframes provided, 3 for training and 3 for validation. This 6 dataloaders are based on the difficulty of the data provided:
* Easy: We create `easy_training_loader` and `easy_testing_loader`.
* Medium: We create `medium_training_loader` and `medium_testing_loader`.
* Hard: We create `hard_training_loader` and `hard_testing_loader`.

In [14]:
# Create a weighted random sampler for the easy dataset
easy_sampler = create_sampler(easy_df_train)

# Create dataloaders for the easy dataset
print("Easy:")
easy_training_loader, easy_testing_loader = create_dataloader(easy_df_train, easy_df_validation, tokenizer, easy_sampler)

Easy:
FULL Dataset: (15729, 2)
TRAIN Dataset: (12902, 2)
TEST Dataset: (2827, 2)



In [30]:
# Create a weighted random sampler for the medium dataset
medium_sampler = create_sampler(medium_df_train)

# Create dataloaders for the medium dataset
print("Medium:")
medium_training_loader, medium_testing_loader = create_dataloader(medium_df_train, medium_df_validation, tokenizer, medium_sampler)

Medium:
FULL Dataset: (35242, 2)
TRAIN Dataset: (28211, 2)
TEST Dataset: (7031, 2)



In [31]:
# Create a weighted random sampler for the hard dataset
hard_sampler = create_sampler(hard_df_train)

# Create dataloaders for the hard dataset
print("Hard:")
hard_training_loader, hard_testing_loader = create_dataloader(hard_df_train, hard_df_validation, tokenizer, hard_sampler)

Hard:
FULL Dataset: (23218, 2)
TRAIN Dataset: (19108, 2)
TEST Dataset: (4110, 2)



# **3. Creating the Neural Network for Fine Tuning**

The `RoBERTaClass` is a custom neural network model that utilizes the RoBERTa transformer model for natural language processing tasks. It is designed to perform classification tasks by taking input text sequences and predicting a binary output.

The class consists of three main layers:
1. `self.l1`: This layer represents the RoBERTa model, which is a pre-trained transformer model for language understanding. It is loaded from the 'roberta-base' pre-trained model using the `transformers` library.
2. `self.l2`: This layer is a dropout layer that helps prevent overfitting by randomly dropping out a fraction of the input units during training.
3. `self.l3`: This layer is a linear layer that maps the output of the RoBERTa model to a single output value. It performs the final classification based on the learned representations from the previous layers.

The `forward()` method defines the forward pass of the model. It takes input `ids` (tokenized input sequence) and `mask` (attention mask) as input and passes them through the layers in the defined order. The output of the model is the predicted output value.

In summary, the `RoBERTaClass` encapsulates the RoBERTa model and provides a convenient interface for performing classification tasks on text data.

We choose RoBERTa over other transofrmer models due to the good results it has given compared to other models like BERT or DistilBERT.

In [32]:
# Define the RoBERTaClass
class RoBERTaClass(torch.nn.Module):
    def __init__(self):
        super(RoBERTaClass, self).__init__()
        self.dropout = 0.5
        self.hidden_embd = 768
        self.output_layer = 1
        # Declare the layers here
        self.l1 = transformers.RobertaModel.from_pretrained('roberta-base')
        self.l2 = torch.nn.Dropout(self.dropout)
        self.l3 = torch.nn.Linear(self.hidden_embd, self.output_layer)

    def forward(self, ids, mask):
        # Use the transformer, then the dropout and the linear in that order.
        _, output_1 = self.l1(ids, attention_mask = mask, return_dict=False)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

## Creating the loss function

The loss function used will be a combination of Binary Cross Entropy which is implemented as [BCELogits Loss]

In [17]:
# Define the loss function
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs.view(-1), targets.view(-1))

## Creating the models for each difficulty

We create 3 RoBERTa models for each difficulty alongside their optimizers:
* Easy: We create `easy_model` and `easy_optimizer`.
* Medium: We create `medium_model` and `medium_optimizer`.
* Hard: We create `hard_model` and `hard_optimizer`.

The optimizers are used to update the weights of the neural network to improve its performance.

In [34]:
# Create instances of the RoBERTaClass for easy dataset
easy_model = RoBERTaClass().to(device)

# Define the optimizers for easy dataset
easy_optimizer = torch.optim.Adam(params = easy_model.parameters(), lr=LEARNING_RATE)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [35]:
# Create instances of the RoBERTaClass for medium dataset
medium_model = RoBERTaClass().to(device)

# Define the optimizers for medium dataset
medium_optimizer = torch.optim.Adam(params = medium_model.parameters(), lr=LEARNING_RATE)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
# Create instances of the RoBERTaClass for hard dataset
hard_model = RoBERTaClass().to(device)

# Define the optimizers for hard dataset
hard_optimizer = torch.optim.Adam(params = hard_model.parameters(), lr=LEARNING_RATE)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training setup

The `train` function is a training loop that is used to train a neural network model on a given dataset. It takes as input the model, training data loader, optimizer, and the current epoch number. 

Inside the function, it sets the model to training mode using `model.train()`. Then, it iterates over the batches of data in the training loader. For each batch, it performs the following steps:

1. Moves the input data (`ids`, `mask`, and `targets`) to the device (e.g., GPU) for computation.
2. Passes the input data through the model to obtain the predicted outputs.
3. Zeros out the gradients of the optimizer.
4. Computes the loss between the predicted outputs and the target values using the `loss_fn` function.
5. Prints the loss value every 10 iterations.
6. Performs backpropagation to compute the gradients of the model parameters with respect to the loss.
7. Updates the model parameters using the optimizer's update rule.

The purpose of this function is to train the model by optimizing its parameters based on the provided training data and the specified loss function.

In [10]:
# Define the training function

def train(difficulty, model, training_loader, optimizer, epoch):
    model.train()
    for _,data in enumerate(training_loader, 0):
       
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask)

        optimizer.zero_grad()

        loss = loss_fn(outputs, targets)

        if _%10==0:
            print(f'{difficulty} dataset, Epoch: {epoch}, Loss:  {loss.item()}')
            
        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

## Initialize the training

The cells below initialize the training for each difficulty dataframe.

In [41]:
# Start training for easy dataset
for epoch in range(EPOCHS):
    train("Easy", easy_model, easy_training_loader, easy_optimizer, epoch)  

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Easy dataset, Epoch: 0, Loss:  0.7009953260421753
Easy dataset, Epoch: 0, Loss:  0.5989694595336914
Easy dataset, Epoch: 0, Loss:  0.44915711879730225
Easy dataset, Epoch: 0, Loss:  0.302805095911026
Easy dataset, Epoch: 0, Loss:  0.4931529760360718
Easy dataset, Epoch: 0, Loss:  0.3349299132823944
Easy dataset, Epoch: 0, Loss:  0.309378445148468
Easy dataset, Epoch: 0, Loss:  0.43250390887260437
Easy dataset, Epoch: 0, Loss:  0.3323203921318054
Easy dataset, Epoch: 0, Loss:  0.3676379323005676
Easy dataset, Epoch: 0, Loss:  0.2547387480735779
Easy dataset, Epoch: 0, Loss:  0.26252469420433044
Easy dataset, Epoch: 0, Loss:  0.27132314443588257
Easy dataset, Epoch: 0, Loss:  0.2526576817035675
Easy dataset, Epoch: 0, Loss:  0.1399601250886917
Easy dataset, Epoch: 0, Loss:  0.09294439852237701
Easy dataset, Epoch: 0, Loss:  0.24725991487503052
Easy dataset, Epoch: 0, Loss:  0.2784508466720581
Easy dataset, Epoch: 0, Loss:  0.15708550810813904
Easy dataset, Epoch: 0, Loss:  0.221536889672

In [22]:
# Start training for medium dataset
for epoch in range(EPOCHS):
    train("Medium", medium_model, medium_training_loader, medium_optimizer, epoch)

Medium dataset, Epoch: 0, Loss:  0.7261993288993835
Medium dataset, Epoch: 0, Loss:  0.6842933297157288
Medium dataset, Epoch: 0, Loss:  0.7016382217407227
Medium dataset, Epoch: 0, Loss:  0.6757066249847412
Medium dataset, Epoch: 0, Loss:  0.6748214960098267
Medium dataset, Epoch: 0, Loss:  0.6900665760040283
Medium dataset, Epoch: 0, Loss:  0.6605957746505737
Medium dataset, Epoch: 0, Loss:  0.5572476387023926
Medium dataset, Epoch: 0, Loss:  0.691226601600647
Medium dataset, Epoch: 0, Loss:  0.599398136138916
Medium dataset, Epoch: 0, Loss:  0.743026852607727
Medium dataset, Epoch: 0, Loss:  0.651751697063446
Medium dataset, Epoch: 0, Loss:  0.6179503202438354
Medium dataset, Epoch: 0, Loss:  0.6795493960380554
Medium dataset, Epoch: 0, Loss:  0.5032213926315308
Medium dataset, Epoch: 0, Loss:  0.5402302742004395
Medium dataset, Epoch: 0, Loss:  0.5252580046653748
Medium dataset, Epoch: 0, Loss:  0.5791972875595093
Medium dataset, Epoch: 0, Loss:  0.6022526025772095
Medium dataset, 

In [12]:
# Start training for hard dataset
for epoch in range(EPOCHS):
    train("Hard", hard_model, hard_training_loader, hard_optimizer, epoch)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Hard dataset, Epoch: 0, Loss:  0.69991135597229
Hard dataset, Epoch: 0, Loss:  0.6875680685043335
Hard dataset, Epoch: 0, Loss:  0.6750866174697876
Hard dataset, Epoch: 0, Loss:  0.7164607644081116
Hard dataset, Epoch: 0, Loss:  0.692764401435852
Hard dataset, Epoch: 0, Loss:  0.6873019933700562
Hard dataset, Epoch: 0, Loss:  0.6840401887893677
Hard dataset, Epoch: 0, Loss:  0.6561689376831055
Hard dataset, Epoch: 0, Loss:  0.7137378454208374
Hard dataset, Epoch: 0, Loss:  0.6998664736747742
Hard dataset, Epoch: 0, Loss:  0.6876404285430908
Hard dataset, Epoch: 0, Loss:  0.6967781782150269
Hard dataset, Epoch: 0, Loss:  0.6689280271530151
Hard dataset, Epoch: 0, Loss:  0.7008069753646851
Hard dataset, Epoch: 0, Loss:  0.6713554859161377
Hard dataset, Epoch: 0, Loss:  0.6972436904907227
Hard dataset, Epoch: 0, Loss:  0.6772338151931763
Hard dataset, Epoch: 0, Loss:  0.690706729888916
Hard dataset, Epoch: 0, Loss:  0.6620197296142578
Hard dataset, Epoch: 0, Loss:  0.6989880800247192
Hard

## Saving the trained models

In case you want to save the already trained models, uncomment the next cell.

In [15]:
#torch.save(easy_model, 'easy_model.pt')
#torch.save(medium_model, 'medium_model.pt')
#torch.save(hard_model, 'hard_model.pt')

# **4. Model Validation**

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data.

This unseen data are the testing_dataloaders we created way before from the validation data.

The cell belows defines two functions: `validation()` and `get_metrics()`. 

* `validation()`: Takes a trained model and a testing data loader as input. It evaluates the model on the testing data and returns the predicted outputs and the true targets. It also prints examples of predictions along with the corresponding text data.

* `get_metrics()`: Takes a difficulty level, a trained model, and a testing data loader as input. It calls the `validation()` function to get the predicted outputs and true targets. Then, it calculates various evaluation metrics such as:
- Accuracy
- Balanced Accuracy 
- Precision
- Recall
- F1 Score 

This metrics are calculated based on the predicted outputs and true targets. Finally, it prints the calculated metrics for the specified difficulty level.

In [11]:
# Function to perform validation on the model
def validation(model, testing_loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    example_predictions = []
    idx = 0
    
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            text_data = data['texts']
            
            outputs = model(ids, mask)

            predictions = torch.sigmoid(outputs).cpu().detach().numpy().tolist()
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())

            if idx < 5:
                example_predictions.append({
                    'texts': text_data[0],
                    'prediction': predictions[0],
                    'true_label': fin_targets[len(fin_targets) - len(targets)]
                })

                idx+=1

    # Show examples of predictions along with the text after validation
    for i, example in enumerate(example_predictions):
        paragraphs = example['texts'].split('[CLS]')
        
        print(f"Example {i + 1}:\n")
        print(f"Paragraph 1: {paragraphs[0]}")
        print(f"Paragraph 2: {paragraphs[1]}\n")
        print(f"Prediction: {example['prediction']}, Ground truth: {example['true_label']}\n")
                
    return fin_outputs, fin_targets

# Function to calculate metrics for the model
def get_metrics(difficulty, model, testing_loader):

    print(f"Metrics for {difficulty} dataset:")

    outputs, targets = validation(model, testing_loader)
    outputs = np.array(outputs) >= 0.5
    accuracy = metrics.accuracy_score(targets, outputs)
    balanced_accuracy = metrics.balanced_accuracy_score(targets, outputs)
    precision = metrics.precision_score(targets, outputs)
    recall = metrics.recall_score(targets, outputs)
    f1 = metrics.f1_score(targets, outputs)
    
    print(f"\nAccuracy Score = {accuracy}")
    print(f"Balanced accuracy Score = {balanced_accuracy}")
    print(f"Precision Score = {precision}")
    print(f"Recall Score = {recall}")
    print(f"F1 Score = {f1}\n")

## Generating metrics for each difficulty

We generate the metrics of each model using the validation data of its corresponding difficulty.

In [26]:
get_metrics("Easy", easy_model, easy_testing_loader)

Metrics for Easy dataset:
Example 1:

Paragraph 1: Faith was meant to be a healthy and disciplined choice one could make to consciously build their own belief system in their own journey to contentment, peace and happiness with life, but now it has been perverted to create small armies of people who have allowed their subconscious minds to be infiltrated with bigotry, greed, and judgement, which can never help them on their journey to happiness and inner peace, but provides them with a sense of belonging, unity, tribalism, superiority, and purpose nonetheless, and is unfortunately accompanied by the same mechanism that makes traditional Faith so effective by guarding the subconscious from accepting any new information that is not aligned with these trusted belief systems.
Paragraph 2: He’s running low. A lot of attorneys won’t represent him now (there’s a joke in the legal world that says MAGA stands for “make attorneys get attorneys”). And I suspect those who are open to representing 

In [27]:
get_metrics("Medium", medium_model, medium_testing_loader)

Metrics for Medium dataset:




Example 1:

Paragraph 1: The value of used clothing is quite small. She'd have to use ebay, craigslist, and goodwill to prove how much damage was caused. She'd have to find the closest used item for value. It would likely be misdemeanor territory, and I don't know if cops would even bother charging a case like that, especially if they know you aren't going to talk to them. I will say if you talk to them and they can get you to admit any illegal actions (such as the illegal eviction) then they will 100% charge you because you admitted it. So keep your mouth shut. Block the ex and have no contact of any kind. If anyone knocks on your door ignore them, whether it's your ex or cops.
Paragraph 2: You don't talk. If the police have enough to arrest you, they won't need to talk, they'll arrest you. If they are not yet arresting you, it's because they need you to say something--anything--that gives them just enough of a crack of light to make an arrest. So yes, you don't talk, and if arrested,

In [14]:
get_metrics("Hard", hard_model, hard_testing_loader)

Metrics for Hard dataset:
Example 1:

Paragraph 1: Is he even allowed to be around kids? I suggest reaching out to one of the counselors or case workers about your concerns. You could also reach out to a child abuse hotline to ask questions. But people you work with are mandatory reporters and have training to know what to do next.
Paragraph 2: You should ask your supervisors what their plans are as you should be mandatory reporters, yes? That means even with HIPAA you have a duty to report the danger to authorities. However I don't know if this counts as immediate danger? But you shouldn't be able to get in trouble for erring on the side of caution with your report. I'm not sure if you need to inform the police or CPS or both, but there should be some mechanism for you with mandatory reporting.

Prediction: [0.9952267408370972], Ground truth: [1.0]

Example 2:

Paragraph 1: On the plus side, now you probably have an easy out if you want to break your lease early! Otherwise, I think yo

# **5. Model comparison**
Here we will show the comparison of `RoBERTa` to other base transformer models to see why is it better. The models we will test are `BERT` and `DistilBERT`. For the sake of time, we will only test the easy data.

In [51]:
bert_easy_training_loader, bert_easy_testing_loader = create_dataloader(easy_df_train, easy_df_validation, 
                                                                        BertTokenizer.from_pretrained('bert-base-cased'), easy_sampler)

FULL Dataset: (15729, 2)
TRAIN Dataset: (12902, 2)
TEST Dataset: (2827, 2)



In [15]:
distilbert_easy_training_loader, distilbert_easy_testing_loader = create_dataloader(easy_df_train, easy_df_validation, 
                                                                                    DistilBertTokenizer.from_pretrained('distilbert-base-cased'), easy_sampler)

FULL Dataset: (15729, 2)
TRAIN Dataset: (12902, 2)
TEST Dataset: (2827, 2)



In [53]:
# Define the RoBERTaClass
class BERTClass(torch.nn.Module):
    def __init__(self):
        super(BERTClass, self).__init__()
        self.dropout = 0.5
        self.hidden_embd = 768
        self.output_layer = 1
        # Declare the layers here
        self.l1 = transformers.RobertaModel.from_pretrained('bert-base-cased')
        self.l2 = torch.nn.Dropout(self.dropout)
        self.l3 = torch.nn.Linear(self.hidden_embd, self.output_layer)

    def forward(self, ids, mask):
        # Use the transformer, then the dropout and the linear in that order.
        _, output_1 = self.l1(ids, attention_mask = mask, return_dict=False)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

In [7]:
# Define the RoBERTaClass
class DistilBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistilBERTClass, self).__init__()
        self.dropout = 0.5
        self.hidden_embd = 768
        self.output_layer = 1
        # Declare the layers here
        self.l1 = transformers.RobertaModel.from_pretrained('distilbert-base-cased')
        self.l2 = torch.nn.Dropout(self.dropout)
        self.l3 = torch.nn.Linear(self.hidden_embd, self.output_layer)

    def forward(self, ids, mask):
        # Use the transformer, then the dropout and the linear in that order.
        _, output_1 = self.l1(ids, attention_mask = mask, return_dict=False)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

In [55]:
# Create instances of the RoBERTaClass for easy dataset
bert_easy_model = BERTClass().to(device)

bert_easy_optimizer = torch.optim.Adam(params = bert_easy_model.parameters(), lr=LEARNING_RATE)

You are using a model of type bert to instantiate a model of type roberta. This is not supported for all configurations of models and can yield errors.
Some weights of RobertaModel were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['encoder.layer.8.output.dense.bias', 'encoder.layer.2.intermediate.dense.bias', 'encoder.layer.8.intermediate.dense.weight', 'encoder.layer.5.intermediate.dense.weight', 'encoder.layer.6.attention.output.LayerNorm.weight', 'encoder.layer.5.attention.self.query.bias', 'encoder.layer.4.attention.self.query.weight', 'encoder.layer.9.attention.output.dense.weight', 'encoder.layer.11.output.dense.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.10.attention.output.dense.bias', 'encoder.layer.1.attention.output.dense.bias', 'encoder.layer.8.attention.output.dense.weight', 'encoder.layer.7.attention.output.dense.weight', 'encoder.layer.1.attention.self.value.weight', 'encoder.layer.2.output.dense.bias', '

In [8]:
# Create instances of the RoBERTaClass for easy dataset
distilbert_easy_model = DistilBERTClass().to(device)

distilbert_easy_optimizer = torch.optim.Adam(params = distilbert_easy_model.parameters(), lr=LEARNING_RATE)

You are using a model of type distilbert to instantiate a model of type roberta. This is not supported for all configurations of models and can yield errors.
Some weights of RobertaModel were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['encoder.layer.5.output.dense.weight', 'encoder.layer.9.attention.self.query.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.8.attention.self.key.weight', 'encoder.layer.0.attention.output.LayerNorm.bias', 'encoder.layer.2.output.LayerNorm.bias', 'encoder.layer.2.output.LayerNorm.weight', 'encoder.layer.7.attention.output.dense.weight', 'encoder.layer.2.attention.output.LayerNorm.weight', 'encoder.layer.9.output.LayerNorm.bias', 'encoder.layer.3.intermediate.dense.bias', 'encoder.layer.6.attention.output.dense.bias', 'encoder.layer.6.attention.output.LayerNorm.bias', 'encoder.layer.3.attention.self.value.bias', 'encoder.layer.11.attention.self.value.bias', 'encoder.layer.7.output.Laye

In [57]:
for epoch in range(EPOCHS):
    train("BERT Easy", bert_easy_model, bert_easy_training_loader, bert_easy_optimizer, epoch)  

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


BERT Easy dataset, Epoch: 0, Loss:  0.7603780031204224
BERT Easy dataset, Epoch: 0, Loss:  0.3703303337097168
BERT Easy dataset, Epoch: 0, Loss:  0.11205479502677917
BERT Easy dataset, Epoch: 0, Loss:  0.36880260705947876
BERT Easy dataset, Epoch: 0, Loss:  0.18664789199829102
BERT Easy dataset, Epoch: 0, Loss:  0.23761683702468872
BERT Easy dataset, Epoch: 0, Loss:  0.4722900390625
BERT Easy dataset, Epoch: 0, Loss:  0.5425524711608887
BERT Easy dataset, Epoch: 0, Loss:  0.3263493776321411
BERT Easy dataset, Epoch: 0, Loss:  0.41518545150756836
BERT Easy dataset, Epoch: 0, Loss:  0.5762661695480347
BERT Easy dataset, Epoch: 0, Loss:  0.38620203733444214
BERT Easy dataset, Epoch: 0, Loss:  0.5239861607551575
BERT Easy dataset, Epoch: 0, Loss:  0.3394910395145416
BERT Easy dataset, Epoch: 0, Loss:  0.31254613399505615
BERT Easy dataset, Epoch: 0, Loss:  0.6607599258422852
BERT Easy dataset, Epoch: 0, Loss:  0.33252453804016113
BERT Easy dataset, Epoch: 0, Loss:  0.5230900049209595
BERT 

In [60]:
get_metrics("BERT Easy", bert_easy_model, bert_easy_testing_loader)

Metrics for BERT Easy dataset:
Example 1:

Paragraph 1: EIC ships were turned away from ports in the rest of the colonies before they could offload their cargo, but the Governor of Massachusetts, Thomas Hutchinson, refused to give in to popular pressure and refused to allow the British ships to return to England (at issue here was the payment of import duties to the colonies, different than the waived export duties owed to England - Hutchinson's sons and some of his friends were consignees of the tea shipment, and not only would the colony lose the import duty payment, but also the potential income from the sale of the tea. He had a lot to lose, personally, in both real and political terms). When several ultimatums had come and gone, a general meeting of citizens at the Old South Meeting House broke up with "a general huzza for Griffin's wharf.".
Paragraph 2: In addition to the excellent reply above, it's important to note that although the so-called "Boston Tea Party" was a relatively

: 

In [18]:
for epoch in range(EPOCHS):
    train("DistilBERT Easy", distilbert_easy_model, distilbert_easy_training_loader, distilbert_easy_optimizer, epoch)  

DistilBERT Easy dataset, Epoch: 0, Loss:  0.8001834154129028
DistilBERT Easy dataset, Epoch: 0, Loss:  0.4950113594532013
DistilBERT Easy dataset, Epoch: 0, Loss:  0.4724515974521637
DistilBERT Easy dataset, Epoch: 0, Loss:  0.40167802572250366
DistilBERT Easy dataset, Epoch: 0, Loss:  0.552944540977478
DistilBERT Easy dataset, Epoch: 0, Loss:  0.3800603747367859
DistilBERT Easy dataset, Epoch: 0, Loss:  0.6455175280570984
DistilBERT Easy dataset, Epoch: 0, Loss:  0.47416335344314575
DistilBERT Easy dataset, Epoch: 0, Loss:  0.2460191398859024
DistilBERT Easy dataset, Epoch: 0, Loss:  0.256367564201355
DistilBERT Easy dataset, Epoch: 0, Loss:  0.32808569073677063
DistilBERT Easy dataset, Epoch: 0, Loss:  0.5097025036811829
DistilBERT Easy dataset, Epoch: 0, Loss:  0.41308775544166565
DistilBERT Easy dataset, Epoch: 0, Loss:  0.35733169317245483
DistilBERT Easy dataset, Epoch: 0, Loss:  0.2287450134754181
DistilBERT Easy dataset, Epoch: 0, Loss:  0.5161739587783813
DistilBERT Easy datas

In [19]:
get_metrics("DistilBERT Easy", distilbert_easy_model, distilbert_easy_testing_loader)

Metrics for DistilBERT Easy dataset:
Example 1:

Paragraph 1: Others had expressed openings to ideas of a collaboration with other nationalist groups (Mazzini after all had promoted the ideal of a community of the nations, which were to be regarded as “the individuals of humankind”, the nations being to humankind as citizens were to the nation), or rather to favor the idea of a cooperation of the nationalities (notably, the Austro-Hungarian Empire was sitting atop the many nationalities of the Balkans, even if the relations between the Italian Nationalists and the Triple Alliance are rather complicated and ambivalent); but those inclinations – which played into the traditional “democratic” tendencies towards an expansion which was mindful of the rights of the other nationalities, yet opposed to the natural expansive direction of the Italian imperialism, which looked for an affirmation within the Mediterranean – were among the first to be restrained and quelled. One of the rising stars 

`RoBERTa`:

- Accuracy Score = 0.9547
- Balanced Accuracy Score = 0.8740
- Precision Score = 0.9644
- Recall Score = 0.9841
- F1 Score = 0.9741

`BERT`:

- Accuracy Score = 0.8475
- Balanced Accuracy Score = 0.5586
- Precision Score = 0.8811
- Recall Score = 0.9527
- F1 Score = 0.9155

`DistilBERT`:

- Accuracy Score = 0.8511
- Balanced Accuracy Score = 0.6021
- Precision Score = 0.8925
- Recall Score = 0.9416
- F1 Score = 0.9164


By comparing the metrics obtanined of each model, we come to the conclusion that RoBERTa is the winner. In general, RoBERTa outperforms the other two models on all metrics, with higher accuracy, recall and F1-score, indicating a better ability to correctly classify positive and negative samples. It also has a very high precision and recall score, suggesting that it minimises both false positives and false negatives.

BERT performs better than DistilBERT on most metrics, but still lags behind RoBERTa in terms of overall accuracy and ability to balance precision and recall.

In conclusion, RoBERTa appears to be the superior model in this comparison, as it performs higher on all metrics evaluated.