# Training deBERTa (small version) classifier for sentimental analysis of Reddit data
Training a deep learning model (deBERTa) that is based on the transformer architecture as part of the BERT encoder, which lies at the heart of the model. The model and pre-trained weights taken from [hugging face](https://huggingface.co/microsoft/deberta-v3-small). deBerta is generally proposed in [this paper](https://arxiv.org/abs/2006.03654). Here, we take the smallest model from the [repo of microsoft/deberta-v3-small](https://huggingface.co/microsoft/deberta-v3-small/tree/main). 

Results:
- f1 score   macro avg: 0.73
- f1 score  weighted avg: 0.78

Note (from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)):
- F1 = 2 * (precision * recall) / (precision + recall)
- 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
- 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

Also varied hyperparameters, which had no real impact on results (probing large parameter space may be required)
- learning rate/10 (0.73, 0.78)
- weight decay *10 (0.73, 0.78)

Note (from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)):
- F1 = 2 * (precision * recall) / (precision + recall)
- 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
- 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

Model weights: \
Weights need to be downloaded from the [repo of microsoft/deberta-v3-small](https://huggingface.co/microsoft/deberta-v3-small/tree/main) and put in the folder of execution to use them.

Data:\
Data is downloaded when the notebook is executed from hugging face.

In [None]:
# based on https://www.kaggle.com/code/tanlikesmath/feedback-prize-effectiveness-eda-deberta-baseline/notebook
# transformers taken from https://huggingface.co/microsoft/deberta-v3-small

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import utils

import torch
import torch.nn.functional
import datasets

import skmultilearn.model_selection.iterative_stratification
import sklearn.metrics

import transformers

In [None]:
torch.cuda.empty_cache()  # empties gpu memory, may be required when interrupting training due bugs/user input

In [None]:
model_nm = (
    "/p/project/deepacf/maelstrom/ehlert1/deberta-v3-small"  # model repo downloaded from hugging face see link above
)

In [None]:
!ls /p/project/deepacf/maelstrom/ehlert1/deberta-v3-small

## Tokenize data with same tokenizer used for pre-training the model

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_nm)

In [None]:
separator_token = tokenizer.sep_token
separator_token

## Load data

In [None]:
num_labels = 28
emotions = datasets.load_dataset("go_emotions", "simplified")
df_raw = pd.concat(
    [
        emotions.data["train"].table.to_pandas(),
        emotions.data["validation"].table.to_pandas(),
        emotions.data["test"].table.to_pandas(),
    ]
)
y_raw = utils.convert_df_labels(df_raw, num_labels)
df_unique = utils.remove_ambiguous_data(df_raw, y_raw)
# updated data frame shape, therefore need to recompute y labels
y_unique = utils.convert_df_labels(df_unique, num_labels)
# explanation for iterative stratification of labels http://videolectures.net/ecmlpkdd2011_tsoumakas_stratification/?q=stratification%20multi%20label
(
    indices_train,
    y_train,
    indices_test,
    y_test,
) = skmultilearn.model_selection.iterative_stratification.iterative_train_test_split(
    np.arange(df_unique.shape[0]).reshape(-1, 1), y_unique, 0.1
)
indices_train = indices_train[:, 0]
indices_test = indices_test[:, 0]

In [None]:
y_binarized = utils.binarize_labels_torch(y_unique)

In [None]:
df_unique["label"] = y_binarized

In [None]:
df_reduced = df_unique.rename(columns={"text": "inputs"})
df_reduced = df_reduced.drop(columns=["id", "labels"])

In [None]:
_dataset = datasets.Dataset.from_pandas(df_reduced)

In [None]:
def tok_func(x):
    return tokenizer(x["inputs"], truncation=False)

In [None]:
tok_map = _dataset.map(tok_func, batched=True, remove_columns="inputs")

### Convert dataset to object readable by `transformers.trainer`

In [None]:
dataset_training = datasets.DatasetDict(
    {
        "train": tok_map.select(indices_train),
        "test": tok_map.select(indices_test),
    }
)

In [None]:
def get_dataset(df, tok_func, train=True):
    ds = datasets.Dataset.from_pandas(df)
    to_remove = ["label"]
    tok_ds = ds.map(tok_func, batched=True, remove_columns=to_remove)
    if train:
        return datasets.DatasetDict(
            {
                "train": tok_ds.select(indices_train),
                "test": tok_ds.select(indices_test),
            }
        )
    else:
        return tok_ds

In [None]:
def try_all_gpus():
    """Return all available GPUs, or [cpu(),] if no GPU exists.

    Defined in :numref:`sec_use_gpu`"""
    devices = [torch.device(f"cuda:{i}") for i in range(torch.cuda.device_count())]
    return devices if devices else [torch.device("cpu")]

## Setting hyperparameters  

In [None]:
learning_rate = 8e-5
batch_size = 8
weight_decay = 0.01
epochs = 1

In [None]:
def score(preds):
    return {
        "log loss": sklearn.metrics.log_loss(
            preds.label_ids,
            torch.nn.functional.softmax(torch.Tensor(preds.predictions)),
        )
    }

## Defining trainer object

In [None]:
num_labels = 2


def get_trainer(dds, num_labels):
    args = transformers.TrainingArguments(
        "/p/project/deepacf/maelstrom/ehlert1/output_RedditSentimentMultiLabelClassificationTransformerBaseline/",
        learning_rate=learning_rate,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        fp16=True,
        evaluation_strategy="epoch",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size * 2,
        num_train_epochs=epochs,
        weight_decay=weight_decay,
        report_to="none",
    )
    model = transformers.AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=num_labels)
    return transformers.Trainer(
        model,
        args,
        train_dataset=dds["train"],
        eval_dataset=dds["test"],
        tokenizer=tokenizer,
        compute_metrics=score,
    )

In [None]:
try_all_gpus()

## Training

In [None]:
trainer = get_trainer(dataset_training, num_labels)
trainer.train()

## Validation

In [None]:
test_ds = get_dataset(df_reduced.iloc[indices_test], tok_func, train=False)

In [None]:
preds = torch.nn.functional.softmax(torch.Tensor(trainer.predict(test_ds).predictions)).numpy()
preds

In [None]:
def convert_pytorch_indices_to_scikitlearn(y):
    y_new = np.zeros((y.shape[0], 2))
    y_new[y == 1, 1] = 1
    y_new[y == 0, 0] = 1
    return y_new

In [None]:
false_positive_rate = dict()
true_positive_rate = dict()
roc_auc = dict()
for i in range(num_labels):
    (
        false_positive_rate[i],
        true_positive_rate[i],
        _,
    ) = sklearn.metrics.roc_curve(
        convert_pytorch_indices_to_scikitlearn(df_reduced.iloc[indices_test].label.values)[:, i],
        preds[:, i],
    )
    roc_auc[i] = sklearn.metrics.auc(false_positive_rate[i], true_positive_rate[i])

In [None]:
plt.figure()
lw = 2
for i in range(num_labels):
    plt.plot(
        false_positive_rate[i],
        true_positive_rate[i],
        lw=lw,
        label="ROC curve (area = %0.2f) for %i" % (roc_auc[i], i),
    )

plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()

In [None]:
print(
    sklearn.metrics.classification_report(
        df_reduced.iloc[indices_test].label.values,
        preds.argmax(-1),
        target_names=["emotional", "neutral"],
    )
)

In [None]:
cm = sklearn.metrics.confusion_matrix(df_reduced.iloc[indices_test].label.values, preds.argmax(-1))
disp = sklearn.metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["emotional", "neutral"])
disp.plot()
ax = plt.gca()
ax.tick_params(axis="x", labelrotation=45)