# Notebook overview
> This notebook contains the implementation for the "Automatic Detection in Twitter of Non-Traumatic. A Deep Learning Approach" paper. The aim of this project is to automatically detect non-traumatic grief expressions in COVID-19 related tweets. We employ deep learning techniques to achieve this goal.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Reproducibility and Determinism

> To ensure reproducibility of our experiments, we have set seed values at various points in the code. This guarantees that the results can be replicated by other researchers.

In [None]:
# Install necessary packages
!pip install sentencepiece
!pip install pytorch-lightning
!pip install --upgrade accelerate
!pip install emoji
!pip install framework-reproducibility

# Import required libraries
import random
import torch
import numpy as np
import os
from pytorch_lightning import seed_everything
import tensorflow as tf
tf.keras.utils.set_random_seed(1)
tf.config.experimental.enable_op_determinism
import fwr13y.d9m.tensorflow as tf_determinism

# Set seed value for reproducibility

tf_determinism.enable_determinism()
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch for plotting
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
seed_everything(seed_val, workers=True)
os.environ["TF_DETERMINISTIC_OPS"] = "1"

# Configure CUDA for deterministic operations
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"

# Enable PyTorch deterministic mode
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Install required packages for NLP tasks
!pip install transformers datasets

# Import additional libraries
from google.colab import drive
from datasets import Dataset, DatasetDict, load_metric
import pandas as pd
import sklearn as sk
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score, average_precision_score
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, \
 TrainingArguments, Trainer, pipeline, EarlyStoppingCallback

# Set seed and enable
from transformers import set_seed, enable_full_determinism
set_seed(seed_val)
enable_full_determinism(seed_val)



INFO:lightning_fabric.utilities.seed:Seed set to 42


fwr13y.d9m.tensorflow.enable_determinism (version 0.6.0) has been applied to TensorFlow version 2.14.0


In [None]:
# Check if PyTorch is identifying a GPU
if torch.cuda.device_count() > 0:
    # If a GPU is available, print its name
    print(f'GPU detected. Currently using: "{torch.cuda.get_device_name(0)}"')
    # Set the device to GPU for accelerated computations
    device = torch.device("cuda")
else:
    # If no GPU is available, inform the user to change the runtime type
    print('Currently using CPU. To utilize GPU acceleration, change the runtime type in the \'runtime\' tab.')


GPU detected. Currently using: "Tesla T4"


## Selecting Pre-trained Models and Hyperparameters

> Bellow we specify the pre-trained models that have been trained and evaluated in our study. Each model checkpoint represents a unique configuration of the deep learning architecture used for the task.

In [None]:
# Define the model checkpoint to be used for the task.
# Uncomment the desired model_checkpoint or replace it with your own.

model_checkpoint = 'xlm-roberta-base' # This model is one of the top-performing models in our experiments por the Spanish dataset
#model_checkpoint = 'bert-base-multilingual-uncased'
#model_checkpoint = 'PlanTL-GOB-ES/roberta-base-bne'
#model_checkpoint = 'microsoft/deberta-base'
#model_checkpoint = 'microsoft/deberta-v3-base'
#model_checkpoint = 'dccuchile/bert-base-spanish-wwm-uncased'
#model_checkpoint = 'roberta-base'
#model_checkpoint = 'bert-base-uncased'


In [None]:
# Hyperparameters
batch_size = 64  # Batch size for training
num_train_epochs = 10  # Number of training epochs
learning_rate = 3e-5  # Learning rate for optimization
max_length = 64  # Maximum sequence length
weight_decay = 0.01  # Weight decay for regularization

## Data Preprocessing

>In this section, we'll perform some initial data preparation steps. This includes loading the dataset, converting text to lowercase, and performing other necessary operations. We've also included the code needed to train with the undersampled dataset.


In [None]:
# NOTE: Before running this cell, please make sure to download and hydrate the necessary files from the GitHub repository.
# Paths for the Spanish/English dataset
train_data_path = 'path_to_train_file_here'
test_data_path = 'path_to_test_file_here'

# Load the training data into a DataFrame
train_df_full = pd.read_csv(train_data_path, encoding='UTF-8', sep='\t')

# Display original label distribution
print("Original distribution - train:", train_df_full.value_counts("label"))

# Apply undersampling for class balancing
df_0 = train_df_full[train_df_full["label"] == 0].sample(n=484, random_state=42)
df_1 = train_df_full[train_df_full["label"] == 1][:]
train_df_under = pd.concat([df_0, df_1])

# Display label distribution after undersampling
print("Distribution after undersampling TRAIN:", train_df_under.value_counts("label"))

# Split the data into training and validation sets
train_df, valid_df = train_test_split(train_df_under, test_size=0.2, shuffle=True, stratify=train_df_under[['label']])

# Further balance validation set
valid_df_0 = valid_df[valid_df["label"] == 0]
valid_df_1 = valid_df[valid_df["label"] == 1]
valid_df = pd.concat([valid_df_0, valid_df_1])

# Display final label distribution for training and validation sets
print("Final distribution TRAIN:", train_df.value_counts("label"))
print("Final distribution validation:", valid_df.value_counts("label"))

# Load the test data
test_df_full = pd.read_csv(test_data_path, encoding='UTF-8', sep='\t')

# Display original label distribution for test set
print("Original distribution - test:", test_df_full.value_counts("label"))

# Further balance test set
dft_0 = test_df_full[test_df_full["label"] == 0]
dft_1 = test_df_full[test_df_full["label"] == 1]
test_df = pd.concat([dft_0, dft_1])

# Display label distribution after X-sampling for test set
print("Distribution after X-sampling test:", test_df.value_counts("label"))

# Display example counts for each dataset
print("Examples in complete dataset:", len(train_df_full))
print("Examples used for training:", len(train_df))
print("Examples used for validation:", len(valid_df))
print("Examples used for test:", len(test_df))

# Rename 'tweet' column to 'text' for consistency
train_df = train_df.rename(columns={'tweet': 'text'})
valid_df = valid_df.rename(columns={'tweet': 'text'})
test_df = test_df.rename(columns={'tweet': 'text'})


## Data Preprocessing
> The preprocessing involved converting all text to lowercase

In [None]:
# Convert text to lowercase
train_df['text'] = train_df['text'].str.lower()
test_df['text'] = test_df['text'].str.lower()
valid_df['text'] = valid_df['text'].str.lower()

# Display the modified training DataFrame
train_df

Unnamed: 0,id,text,label
1168,1255752935956496380,"pobrecito , me imagino que la desesperación po...",1
911,1255119099157532670,"ya, a producir xk la jubilacion de los taitas,...",0
1166,1251868842856534020,"en el manejo de maria, todo aquel que fallecio...",0
93,1254311109282222080,conciertos privados con falta de asistencia sa...,0
542,1247172449860493310,la muerte de la madre de pep guardiola por cor...,1
...,...,...,...
1520,1245312369422741500,se está viendo eso frecuentemente en los casos...,0
863,1256378011919814660,"el mundo en tiempos de coronavirus: muerte, de...",0
437,1249016978020409340,comunicado de muerte de primer médico colombia...,1
1481,1245148954947354620,el mandatario nayib bukele confirmó personalme...,0


## Labeling and Formatting Datasets

>In this section, we define a function set_labels that assigns numerical labels to the dataset records. The labels are based on the original label value, where 0 is mapped to 0, and any other label is mapped to 1. This function is then applied to both the training and validation datasets.

>Finally, we reset the format of the datasets to default to ensure smooth processing.


In [None]:
# In this block, we convert the pandas DataFrames into dataset objects for further processing.
# We then display a portion of the datasets in pandas format, and reset the format to default.

# Convert the DataFrames into dataset objects
train_dataset = Dataset.from_pandas(train_df)
valid_dataset = Dataset.from_pandas(valid_df)

# Set the format of the datasets to pandas
train_dataset.set_format("pandas")
valid_dataset.set_format("pandas")

# Display a portion of the train dataset
# Note: Be cautious if the dataset is very large as it may print a lengthy output
print(train_dataset[:])

# Display a portion of the validation dataset
# Note: Be cautious if the dataset is very large as it may print a lengthy output
print(valid_dataset[:])

# Reset the format of the datasets to default to avoid any potential issues
train_dataset.reset_format()
valid_dataset.reset_format()

                      id                                               text  \
0    1255752935956496380  pobrecito , me imagino que la desesperación po...   
1    1255119099157532670  ya, a producir xk la jubilacion de los taitas,...   
2    1251868842856534020  en el manejo de maria, todo aquel que fallecio...   
3    1254311109282222080  conciertos privados con falta de asistencia sa...   
4    1247172449860493310  la muerte de la madre de pep guardiola por cor...   
..                   ...                                                ...   
769  1245312369422741500  se está viendo eso frecuentemente en los casos...   
770  1256378011919814660  el mundo en tiempos de coronavirus: muerte, de...   
771  1249016978020409340  comunicado de muerte de primer médico colombia...   
772  1245148954947354620  el mandatario nayib bukele confirmó personalme...   
773  1255125062660747260             estoy listo para morir de coronavirus.   

     label  __index_level_0__  
0        1         

In [None]:
# This function takes a record as input, which contains a label named 'label'.
# If the value of this label is 0, it assigns 0 to the variable 'label'. If the value is not 0,
# it assigns 1 to 'label'. Then, the function returns a dictionary with the modified label, named 'labels'.
def set_labels(records):
  if records['label'] == 0:
    label = 0
  else:
    label = 1
  return {'labels': label}

# Map the set_labels function to the train and validation datasets.
# This function assigns a numerical label based on the original label value.
# Label = 0 is mapped to 0, and any other label is mapped to 1.

dataset_train = train_dataset.map(set_labels)
dataset_valid = valid_dataset.map(set_labels)

# Reset the format of the datasets to default to avoid any potential issues
dataset_train.reset_format()
dataset_valid.reset_format()

Map:   0%|          | 0/774 [00:00<?, ? examples/s]

Map:   0%|          | 0/194 [00:00<?, ? examples/s]

## Tokenization Process

>This section outlines the tokenization process, a crucial step in preparing textual data for machine learning tasks. Tokenization involves breaking down the text into smaller units, such as words or subwords, making it suitable for processing by machine learning models. In this context, the provided code accomplishes tokenization and streamlines the dataset for further processing.


In [None]:
# Initialize a tokenizer from the pre-trained model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Set the maximum sequence length
MAX_LEN = max_length

In [None]:
# Define the method to be mapped to the dataset to tokenize the data. This function takes a dictionary 'examples' as input, which contains a key named 'text'.
# The function uses the tokenizer to tokenize the text, truncates it if it exceeds the maximum length (MAX_LEN),
# and pads it to ensure all sequences have the same length.
def tokenize_data(examples):

  return tokenizer(examples["text"], truncation=True, max_length=MAX_LEN, padding=True)

In [None]:
# Get all column names from the train and validation datasets.
columns_train = dataset_train.column_names
columns_valid = dataset_valid.column_names

# Remove the column named "labels" as it's not needed for tokenization.
columns_train.remove("labels")
columns_valid.remove("labels")

# Tokenize the data and remove unnecessary columns.
encoded_dataset_train = dataset_train.map(tokenize_data, batched=True, remove_columns=columns_train)
encoded_dataset_valid = dataset_valid.map(tokenize_data, batched=True, remove_columns=columns_valid)


Map:   0%|          | 0/774 [00:00<?, ? examples/s]

Map:   0%|          | 0/194 [00:00<?, ? examples/s]

## Metrics definition and Set Up


> In this section, we define the metrics that will be used to evaluate the performance of the model. Additionally, we set up the necessary configurations for training the model, including parameters number of labels. This step is crucial for ensuring the model is trained effectively and evaluated accurately.



In [None]:
# Function to compute various metrics during evaluation
def compute_metrics(eval_pred):
  """
  Compute metrics for Trainer
  Args:
      eval_pred: The evaluation predictions

  Returns:
      Dictionary containing computed metrics
  """

  labels = eval_pred.label_ids
  preds = eval_pred.predictions.argmax(-1)

  # Compute precision, recall, F1-score, and support
  precision, recall, f1, _ = sk.metrics.precision_recall_fscore_support(labels, preds, average="macro")

  # Calculate F1-score for the minority class (label = 1)
  f1_minoritaria= f1_score(labels, preds, pos_label=1)

  # Calculate F1-score for the majority class (label = 0)
  f1_mayoritaria = f1_score(labels, preds, pos_label=0)

  # Calculate accuracy
  acc = sk.metrics.accuracy_score(labels, preds)

  # Calculate Area Under the Curve (AUC)
  AUC = roc_auc_score(labels, preds)

  # Calculate Precision-Recall Area Under the Curve (AUC)
  PREC_REC = average_precision_score(labels, preds)

  return {
      'accuracy': acc,
      'f1': f1,
      'precision': precision,
      'recall': recall,
      'AUC': AUC,
      'f1_minoritaria': f1_minoritaria,
      'f1_mayoritaria': f1_mayoritaria,
      'PREC_REC': PREC_REC
  }


In [None]:
# Define the number of labels in the dataset
n_labels = 2  # Binary classification task (0 or 1)


In [None]:
# Extract the model name from the model_checkpoint path
model_name = model_checkpoint.split("/")[-1]
model_name

'xlm-roberta-base'

## Fine-tuning the Model

>In this phase, we fine-tune the pre-trained model on our specific task of detecting non-traumatic grief in Twitter messages related to COVID-19.

>We initialize the model with the pre-trained weights and add a classification layer on top to tailor it to our binary classification task. Additionally, we carefully configure hyperparameters such as batch size, learning rate, and number of training epochs to ensure optimal performance.

>The fine-tuning process is closely monitored, and early stopping criteria are implemented to prevent overfitting. Metrics like accuracy, precision, recall, and area under the ROC curve (AUC-ROC) are used to evaluate the model's performance.

>Finally, the fine-tuned model is saved for future use and deployment.


In [None]:
# Define a function to return the maximum of two values
def maximum(a, b):
    if a >= b:
        return a
    else:
        return b

# Set the number of training samples and evaluation samples
num_train_samples = int(len(encoded_dataset_train))
num_evaluation = int(len(encoded_dataset_valid))

# Calculate the logging steps
value = len(encoded_dataset_train) // (2 * batch_size * num_train_epochs)
logging_steps = maximum(1, value)

# List of optimizer options
optim = ["adamw_hf", "adamw_torch", "adamw_apex_fused","adafactor","adamw_torch_xla"]

# Define the training arguments for the Trainer
training_args = TrainingArguments(
    output_dir='results',
    num_train_epochs=num_train_epochs,
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model='AUC',  # Choose the metric for the best model
    weight_decay=weight_decay,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    save_total_limit=3,
    optim=optim[1],  # Choose the optimizer option
    push_to_hub=False,  # Set to True if you want to push the model to the Hugging Face Model Hub
)

In [None]:
def model_init():
    # Initialize the model for sequence classification
    return AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint,
        num_labels=n_labels,
        output_attentions=False,  # Whether the model returns attention weights.
        output_hidden_states=False,
        return_dict=True
    )

# Initialize the Trainer with specified parameters
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    train_dataset=encoded_dataset_train,
    eval_dataset=encoded_dataset_valid,
    tokenizer=tokenizer
)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# trainer.train() initiates the training process of the model.
# During this phase, the model's parameters are adjusted based on the training dataset
# to minimize the defined loss function.
trainer.train()


##Model Evaluation

> To assess the performance of our fine-tuned model, we conduct a comprehensive evaluation using various metrics. These metrics provide insights into how well the model generalizes to unseen data and its ability to accurately detect non-traumatic grief in COVID-19-related Twitter messages.

In [None]:
# Evaluate the model using the validation dataset. This generates metrics for performance evaluation.
eval = trainer.evaluate()

# Convert the evaluation results to a pandas DataFrame for better readability and handling.
dfeval = pd.DataFrame(list(eval.items()), columns = ['Name','Value_Validation'])
dfeval

In [None]:
trainer.save_model('path_to_saved_model_here')

##Model Test
>This code performs evaluation on the test and validation datasets using the fine-tuned model. It computes various evaluation metrics and prints a summary of the results. It also provides detailed information about the model and training hyperparameters.

In [None]:
model_path ='path_to_saved_model_here'
model = AutoModelForSequenceClassification.from_pretrained(model_path)

In [None]:
# Apply the same text cleaning functions
test_df['text'] = test_df['text'].str.lower()

# Convert the DataFrames to Dataset objects
test_dataset = Dataset.from_pandas(test_df)

# Apply label mapping only when evaluating a labeled test set
# Convert label to 'labels' and give it numeric format
test_dataset = test_dataset.map(set_labels)


In [None]:
# Initialize a pipeline for text classification using the fine-tuned model
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

In [None]:
# Define a function to get predictions from the pipeline
def get_predictions(records):
  result = pipe(records['text'], truncation=True)
  pred_label = result[0]['label']
  score_label = result[0]['score']

  if pred_label == 'LABEL_0':
    pred_label = 0
  else:
    pred_label = 1

  return {'pred_label': pred_label, 'score_label': score_label}

# Apply the get_predictions function to the test and valid datasets
test_dataset_predicted = test_dataset.map(get_predictions)

# Set the format of the datasets to pandas for easy inspection
test_dataset_predicted.set_format('pandas')
df = test_dataset_predicted[:]



In [None]:
# Define a function to compute evaluation metrics
def compute_metrics(pred):
  labels = pred[1]
  preds = pred[0]
  precision, recall, f1, _ = sk.metrics.precision_recall_fscore_support(labels, preds, average="macro")
  f1_minoritaria= f1_score(labels, preds, pos_label=1)
  f1_mayoritaria = f1_score(labels, preds, pos_label=0)
  acc = sk.metrics.accuracy_score(labels, preds)
  AUC = roc_auc_score(labels, preds)
  PREC_REC = average_precision_score(labels, preds)
  return { 'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall, 'AUC': AUC ,
           'f1_minoritaria': f1_minoritaria, 'f1_mayoritaria': f1_mayoritaria, 'PREC_REC': PREC_REC}

In [None]:
# Convert the pandas series to python lists for computing metrics
test_labels = test_df['label'].values.tolist()
test_predictions = df['pred_label'].values.tolist()
eval_pred = [test_predictions, test_labels]




In [None]:
# Print model information and evaluation report
print("*********************************")
print("*********************************")
print("model : ", model_checkpoint)
print("epoch ", num_train_epochs)
print("batch size:", batch_size)
print("max_len :", MAX_LEN)
print("learning_rate :", learning_rate)
print("weight_decay :", weight_decay)
print("num_train_epochs :", num_train_epochs)

In [None]:
# Test report
p = compute_metrics(eval_pred)

dftest = pd.DataFrame([[key, p[key]] for key in p.keys()], columns=['Name', 'Value'])



# Print classification report
print(classification_report(test_labels, test_predictions))

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(test_labels, test_predictions))

# Print AUC and PREC_REC
print(f'AUC: {roc_auc_score(test_labels, test_predictions)}')
print(f'PREC_REC: {average_precision_score(test_labels, test_predictions)}')