# Toxicity Type Detection

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DougTrajano/ToChiquinho/blob/toxicity_type_detection/notebooks/toxicity_type_detection_colab.ipynb) <- Edit it later

In this notebook, we are going to fine-tune BERT to predict possible toxicity labels for a given text.

## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [21]:
!pip install -q transformers datasets mlflow optuna

[K     |████████████████████████████████| 17.0 MB 31.5 MB/s 
[K     |████████████████████████████████| 348 kB 48.8 MB/s 
[K     |████████████████████████████████| 182 kB 69.2 MB/s 
[K     |████████████████████████████████| 147 kB 65.8 MB/s 
[K     |████████████████████████████████| 79 kB 8.5 MB/s 
[K     |████████████████████████████████| 209 kB 68.2 MB/s 
[K     |████████████████████████████████| 77 kB 7.3 MB/s 
[K     |████████████████████████████████| 78 kB 7.6 MB/s 
[K     |████████████████████████████████| 62 kB 1.6 MB/s 
[K     |████████████████████████████████| 147 kB 68.7 MB/s 
[K     |████████████████████████████████| 146 kB 69.0 MB/s 
[K     |████████████████████████████████| 55 kB 3.9 MB/s 
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
[K     |████████████████████████████████| 60 kB 7.8 MB/s 
[K     |████████████████████████████████| 81 kB 11.3 MB/s 
[K     |████████████████████████████████| 50 kB 6.7 MB/s 
[K     |███████████████████████████████

In [1]:
import os

os.environ["KAGGLE_USERNAME"] = None
os.environ["KAGGLE_KEY"] = None

os.environ["MLFLOW_TRACKING_URI"] = None
os.environ["MLFLOW_EXPERIMENT_NAME"] = None
os.environ["MLFLOW_TRACKING_USERNAME"] = None
os.environ["MLFLOW_TRACKING_PASSWORD"] = None
os.environ["HF_MLFLOW_LOG_ARTIFACTS"] = None
os.environ["MLFLOW_NESTED_RUN"] = None
os.environ["MLFLOW_FLATTEN_PARAMS"] = None
os.environ["WANDB_DISABLED"] = None

os.environ["AWS_ACCESS_KEY_ID"] = None
os.environ["AWS_SECRET_ACCESS_KEY"] = None

## Parameters

In [2]:
from dataclasses import dataclass, field

@dataclass
class Parameters:
    max_seq_length: int = field(
        default=512,
        metadata={
            "help": (
                "The maximum total input sequence length after tokenization. Sequences longer "
                "than this will be truncated, sequences shorter will be padded."
            )
        }
    )
    
    model_name: str = field(
        default="neuralmind/bert-base-portuguese-cased",
        metadata={
            "help": "The name of the model to use. It must be a model name or a path to a directory containing model weights."
        }
    )

    num_train_epochs: int = field(
        default=5,
        metadata={
            "help": "The number of epochs to train the model. An epoch is an iteration over the entire training set."
        }
    )

    num_train_epochs_per_child: int = field(
        default=1,
        metadata={
            "help": "The number of epochs to train the model. An epoch is an iteration over the entire training set."
        }
    )

    batch_size: int = field(
        default=4,
        metadata={
            "help": "The batch size to use for training and evaluation."
        }
    )

    validation_split: float = field(
        default=0.2,
        metadata={
            "help": "The percentage of the training set to use as validation set."
        }
    )

    optuna_trials: int = field(
        default=20,
        metadata={
            "help": "The number of trials to run for optuna."
        }
    )
    
    seed: int = field(
        default=1993,
        metadata={
            "help": "The seed to use for random number generation."
        }
    )

params = Parameters()
params

Parameters(max_seq_length=512, model_name='neuralmind/bert-base-portuguese-cased', num_train_epochs=5, num_train_epochs_per_child=1, batch_size=4, validation_split=0.2, optuna_trials=20, seed=1993)

## Load dataset

Next, let's download the OLID-BR dataset. This dataset contains 5,710 texts in Portuguese, annotated with 5 NLP tasks.

We are going to model the toxicity type detection task, which is a multi-label classification task. The labels are:

- `health`: Whether the text contains hate speech based on health conditions such as disability, disease, etc.
- `ideology`: Indicates if the text contains hate speech based on a person's ideas or beliefs.
- `insult`: Whether the text contains insult, inflammatory, or provocative content.
- `lgbtqphobia`: Whether the text contains harmful content related to gender identity or sexual orientation.
- `other_lifestyle`: Whether the text contains hate speech related to life habits (e.g. veganism, vegetarianism, etc.).
- `physical_aspects`: Whether the text contains hate speech related to physical appearance.
- `profanity_obscene`: Whether the text contains profanity or obscene content.
- `racism`: Whether the text contains prejudiced thoughts or discriminatory actions based on differences in race/ethnicity.
- `religious_intolerance`: Whether the text contains religious intolerance.
- `sexism`: Whether the text contains discriminatory content based on differences in sex/gender (e.g. sexism, misogyny, etc.).
- `xenophobia`: Whether the text contains hate speech against foreigners.

As the dataset contains several tasks, we will filter the dataset to only keep the data relevant to the toxicity type detection task. We will do it in the `prepare_dataset()` function.

In [3]:
import json
import mlflow
import pandas as pd
import warnings
from typing import List, Dict, Union
from datasets import Dataset
from kaggle.api.kaggle_api_extended import KaggleApi

def download_dataset(
        output_files: Union[str, List[str]] = "train.csv",
        dataset_files: List[str] = [
            "olidbr.csv.zip",
            "train.csv",
            "test.csv",
            "train_metadata.csv",
            "test_metadata.csv",
            "additional_data.json",
            "train.json",
            "test.json"
        ]) -> Dict[str, Union[Dict, pd.DataFrame]]:
    """Download dataset from Kaggle.

    Args:
    - output_files: List of files to be outputted.
    - dataset_files: List of files to be downloaded and deleted.

    Returns:
    - A dictionary with the output files as keys and the content as values.
    """
    # Download OLID-BR dataset
    for file in dataset_files:
        if not os.path.exists(file):
            print(f"Downloading OLID-BR from Kaggle.")
            kaggle = KaggleApi()
            kaggle.authenticate()
            kaggle.dataset_download_files(dataset="olidbr", unzip=True)

    # Load data
    result = {}
    for file in output_files:
        if file.endswith(".csv"):
            result[file] = pd.read_csv(file)
        elif file.endswith(".json"):
            result[file] = json.load(open(file, "r"))
        else:
            raise ValueError(f"File {file} is not supported.")

    # Delete files
    for file in dataset_files:
        if os.path.exists(file):
            os.remove(file)

    return result

def prepare_dataset(df: pd.DataFrame) -> pd.DataFrame:
    """Preprocess the dataset.

    Args:
    - df: The dataset to be preprocessed.

    Returns:
    - The preprocessed dataset.
    """
    columns = [
        "text",
        "health",
        "ideology",
        "insult",
        "lgbtqphobia",
        "other_lifestyle",
        "physical_aspects",
        "profanity_obscene",
        "racism",
        "sexism",
        "xenophobia"
    ]

    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        
        # Filter only offensive comments
        df = df[df["is_offensive"] == "OFF"]

        # Remove religious_intolerance that has only one  sample
        if "religious_intolerance" in df.columns:
            df.drop("religious_intolerance", axis=1, inplace=True)

        # Filter only offensive comments with at least one toxicity label
        df = df.loc[df.select_dtypes("bool").sum(axis=1).ge(1)]

        # Filter only the columns of interest
        df = df[columns]
        
        return df.reset_index(drop=True)

dataset = download_dataset(["train.csv", "test.csv"])

train_df = dataset["train.csv"]
test_df = dataset["test.csv"]

del dataset

train_df = prepare_dataset(train_df)
test_df = prepare_dataset(test_df)

Downloading OLID-BR from Kaggle.


In [4]:
dataset = Dataset.from_dict(
    {
        k: v for k, v in train_df.to_dict(orient="list").items()
    }
)

dataset = dataset.train_test_split(
    test_size=params.validation_split,
    shuffle=True,
    seed=params.seed
)

dataset["validation"] = dataset.pop("test")

# Add test dataset
dataset["test"] = Dataset.from_dict(
    {
        k: v for k, v in test_df.to_dict(orient="list").items()
    }
)

The OLID-BR was published in 2 splits: train and test. We splitted the train set into train and validation sets.

So, we have 3 splits: one for training, one for validation and one for testing.

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'health', 'ideology', 'insult', 'lgbtqphobia', 'other_lifestyle', 'physical_aspects', 'profanity_obscene', 'racism', 'sexism', 'xenophobia'],
        num_rows: 3417
    })
    validation: Dataset({
        features: ['text', 'health', 'ideology', 'insult', 'lgbtqphobia', 'other_lifestyle', 'physical_aspects', 'profanity_obscene', 'racism', 'sexism', 'xenophobia'],
        num_rows: 855
    })
    test: Dataset({
        features: ['text', 'health', 'ideology', 'insult', 'lgbtqphobia', 'other_lifestyle', 'physical_aspects', 'profanity_obscene', 'racism', 'sexism', 'xenophobia'],
        num_rows: 1438
    })
})

In [6]:
mlflow.start_run()

mlflow.set_tag("project", "ToChiquinho")
mlflow.set_tag("model_type", "bert")
mlflow.set_tag("problem_type", "multi_label_classification")

mlflow.log_param("train_size", len(dataset["train"]))
mlflow.log_param("validation_size", len(dataset["validation"]))
mlflow.log_param("test_size", len(dataset["test"]))

1438

Let's check the first example of the training split:

In [7]:
example = dataset["train"][0]
example

{'text': 'Ela mendigava o amor💘, na falta de atenção, o mendigo não negou!!!!',
 'health': False,
 'ideology': False,
 'insult': True,
 'lgbtqphobia': False,
 'other_lifestyle': False,
 'physical_aspects': False,
 'profanity_obscene': False,
 'racism': False,
 'sexism': False,
 'xenophobia': False}

The dataset consists of texts, labeled with one or more toxicity types.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [8]:
labels = [label for label in dataset["train"].features.keys() if label not in ["text"]]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['health',
 'ideology',
 'insult',
 'lgbtqphobia',
 'other_lifestyle',
 'physical_aspects',
 'profanity_obscene',
 'racism',
 'sexism',
 'xenophobia']

## Preprocess data

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [9]:
import numpy as np
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(params.model_name)

def preprocess_data(examples, tokenizer, max_seq_length):
    # take a batch of texts
    text = examples["text"]

    # encode them
    encoding = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=max_seq_length
    )

    # add labels
    labels_batch = {k: examples[k] for k in examples.keys() if k in labels}

    # create numpy array of shape (batch_size, num_labels)
    labels_matrix = np.zeros((len(text), len(labels)))

    # fill numpy array
    for idx, label in enumerate(labels):
        labels_matrix[:, idx] = labels_batch[label]

    encoding["labels"] = labels_matrix.tolist()

    return encoding

In [10]:
encoded_dataset = dataset.map(
    preprocess_data,
    batched=True,
    remove_columns=dataset["train"].column_names,
    fn_kwargs={
        "tokenizer": tokenizer,
        "max_seq_length": params.max_seq_length
    }
)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [11]:
example = encoded_dataset["train"][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [12]:
tokenizer.decode(example["input_ids"])

'[CLS] Ela mendigava o [UNK], na falta de atenção, o mendigo não negou!!!! [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

In [13]:
example["labels"]

[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In [14]:
[id2label[idx] for idx, label in enumerate(example["labels"]) if label == 1.0]

['insult']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html). 

In [15]:
encoded_dataset.set_format("torch")

## Define model

Here we define a model that includes a pre-trained base (i.e. the weights from neuralmind/bert-base-portuguese-cased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    params.model_name,
    problem_type="multi_label_classification",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

Let's verify a batch as well as a forward pass:

In [17]:
encoded_dataset["train"][0]["labels"].type()

'torch.FloatTensor'

In [18]:
encoded_dataset["train"]["input_ids"][0]

tensor([  101,  1660,  1462,  8066,  2836,   146,   100,   117,   229,  3207,
          125,  3855,   117,   146,  1462,  8066, 22280,   346, 13682,   106,
          106,   106,   106,   102,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0, 

In [19]:
#forward pass
outputs = model(
    input_ids=encoded_dataset["train"]["input_ids"][0].unsqueeze(0),
    labels=encoded_dataset["train"][0]["labels"].unsqueeze(0)
)
   
outputs

SequenceClassifierOutput(loss=tensor(0.6846, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.0490, -0.0957,  0.2244,  0.0161,  0.1982,  0.4158,  0.1154, -0.0593,
         -0.2791, -0.3272]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## Define function to compute metrics

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [20]:
import torch
from transformers import EvalPrediction
from sklearn.metrics import (
    f1_score,
    roc_auc_score,
    accuracy_score
)

def predict(predictions,
            return_proba: bool = False,
            threshold: float = 0.5):
    """Predict the labels of a batch of samples.

    Args:
    - predictions: The predictions of the model.
    - return_proba: Whether to return the probability of each label.
    - threshold: The threshold to be used to convert the logits to labels.

    Returns:
    - The predicted labels.
    """
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs: np.ndarray = sigmoid(torch.Tensor(predictions))

    if return_proba:
        return probs
    else:
        # use threshold to turn them into integer predictions
        y_pred = np.zeros(probs.shape)
        y_pred[np.where(probs >= threshold)] = 1    
        return y_pred
        
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold: float = 0.5):
    """Compute the metrics for multi-label classification.

    Args:
    - predictions: The predictions of the model.
    - labels: The true labels.
    - threshold: The threshold to be used to convert the logits to labels.

    Returns:
    - A dictionary containing the metrics (accuracy, f1, roc_auc).
    """
    y_pred = predict(predictions, threshold=threshold)

    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average="micro")
    roc_auc = roc_auc_score(y_true, y_pred, average="micro")
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {
        "f1": f1_micro_average,
        "roc_auc": roc_auc,
        "accuracy": accuracy
    }

    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

## Hyperparameter Optimization

We are going to use the [Optuna](https://optuna.org/) library to perform hyperparameter optimization. Optuna is a hyperparameter optimization framework, which allows you to define your search space and run the optimization. It supports several algorithms, including Bayesian Optimization, Tree-structured Parzen Estimator (TPE), and Random Search.

We will use HuggingFace's [Trainer](https://huggingface.co/docs/transformers/main/main_classes/trainer) API that supports Optuna.

In [21]:
import optuna

def optuna_hp_space(trial: optuna.Trial):
    return {
        "learning_rate": trial.suggest_float(
            "learning_rate", 1e-5, 1e-3, log=True),
        "weight_decay": trial.suggest_float(
            "weight_decay", 1e-6, 1e-2, log=True),
        "adam_beta1": trial.suggest_float(
            "adam_beta1", 0.8, 0.999, log=True),
        "adam_beta2": trial.suggest_float(
            "adam_beta2", 0.8, 0.999, log=True),
        "adam_epsilon": trial.suggest_float(
            "adam_epsilon", 1e-8, 1e-6, log=True)
    }

def model_init(trial: optuna.Trial):
    return AutoModelForSequenceClassification.from_pretrained(
        params.model_name,
        problem_type="multi_label_classification",
        num_labels=len(labels),
        id2label=id2label,
        label2id=label2id
    )

def compute_objective(metrics: Dict[str, float]) -> float:
    return metrics["eval_f1"]

Let's start training!

In [22]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-toxicity-type-portuguese",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=params.batch_size,
    per_device_eval_batch_size=params.batch_size,
    num_train_epochs=params.num_train_epochs_per_child,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    seed=params.seed
)

trainer = Trainer(
    model=None,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    model_init=model_init,
    compute_metrics=compute_metrics
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
loading configuration file config.json from cache at C:\Users\trajano/.cache\huggingface\hub\models--neuralmind--bert-base-portuguese-cased\snapshots\94d69c95f98f7d5b2a8700c420230ae10def0baa\config.json
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-base-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "health",
    "1": "ideology",
    "2": "insult",
    "3": "lgbtqphobia",
    "4": "other_lifestyle",
    "5": "physical_aspects",
    "6": "profanity_obscene",
    "7": "racism",
    "8": "sexism",
    "9": "xenophobia"
  },
  "initializer_range": 0.02,
  "intermed

In [23]:
best_trial = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    hp_space=optuna_hp_space,
    compute_objective=compute_objective,
    n_trials=params.optuna_trials
)

[32m[I 2022-11-08 01:06:56,742][0m A new study created in memory with name: no-name-60463b48-0461-4891-825e-63d7bb784576[0m
Trial: {'learning_rate': 0.0008050622920756662, 'weight_decay': 0.0015002639013921464, 'adam_beta1': 0.8060724246888528, 'adam_beta2': 0.8847402551171889, 'adam_epsilon': 1.516627166910823e-07}
loading configuration file config.json from cache at C:\Users\trajano/.cache\huggingface\hub\models--neuralmind--bert-base-portuguese-cased\snapshots\94d69c95f98f7d5b2a8700c420230ae10def0baa\config.json
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-base-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "health",
    "1": "ideology",
    "2": "insult",
    "3": "lgbtqphobia",
    "4": "other_lifestyle",
    "5": "physical_aspects",
    "6": "pro

  0%|          | 0/855 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[33m[W 2022-11-08 01:07:11,649][0m Trial 0 failed because of the following error: OutOfMemoryError('CUDA out of memory. Tried to allocate 48.00 MiB (GPU 0; 4.00 GiB total capacity; 3.35 GiB already allocated; 0 bytes free; 3.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')
Traceback (most recent call last):
  File "c:\Python310\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "c:\Python310\lib\site-packages\transformers\integrations.py", line 179, in _objective
    trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
  File "c:\Python310\li

OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 0; 4.00 GiB total capacity; 3.35 GiB already allocated; 0 bytes free; 3.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [41]:
model = model_init(best_trial)

args = TrainingArguments(
    f"bert-finetuned-toxicity-type-portuguese",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=params.batch_size,
    per_device_eval_batch_size=params.batch_size,
    num_train_epochs=params.num_train_epochs,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    seed=params.seed,
    **best_trial.hyperparameters
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--neuralmind--bert-base-portuguese-cased/snapshots/94d69c95f98f7d5b2a8700c420230ae10def0baa/config.json
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-base-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "health",
    "1": "ideology",
    "2": "insult",
    "3": "lgbtqphobia",
    "4": "other_lifestyle",
    "5": "physical_aspects",
    "6": "profanity_obscene",
    "7": "racism",
    "8": "sexism",
    "9": "xenophobia"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "health": 0,
    "ideology": 1,
    "insult": 2,
    "lgbtqphobia": 3,
    "other_lifestyle": 4,
    "physical_aspects": 5,
    "profanity_obscene": 6,
    "racism": 7,
   

In [42]:
trainer.train()

***** Running training *****
  Num examples = 3417
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 4275
  Number of trainable parameters = 108930826


MlflowException: ignored

## Evaluate

After training, we evaluate our model on the validation set.

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 855
  Batch size = 8


{'eval_loss': 0.1706046313047409,
 'eval_f1': 0.8412698412698414,
 'eval_roc_auc': 0.8933846707911347,
 'eval_accuracy': 0.5695906432748538,
 'eval_runtime': 28.4905,
 'eval_samples_per_second': 30.01,
 'eval_steps_per_second': 3.756,
 'epoch': 5.0}

In [None]:
trainer.evaluate(
    eval_dataset=encoded_dataset["test"]
)

***** Running Evaluation *****
  Num examples = 1438
  Batch size = 8


{'eval_loss': 0.17874790728092194,
 'eval_f1': 0.8307148468185388,
 'eval_roc_auc': 0.8867688862716017,
 'eval_accuracy': 0.5486787204450626,
 'eval_runtime': 48.1046,
 'eval_samples_per_second': 29.893,
 'eval_steps_per_second': 3.742,
 'epoch': 5.0}

## Inference

Let's test the model on a new sentence:

In [None]:
text = "Se vc for porco, folgado e relaxado, você não ia conseguir viver com ela mesmo. Realmente, gente escrota não ia conseguir conviver com a Jojo"

encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)

The logits that come out of the model are of shape (batch_size, num_labels). As we are only forwarding a single sentence through the model, the `batch_size` equals 1. The logits is a tensor that contains the (unnormalized) scores for every individual label.

In [None]:
logits = outputs.logits
logits.shape

torch.Size([1, 10])

To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a "probability" for how certain the model is that a given class belongs to the input text.

Next, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).

In [None]:
# apply sigmoid + threshold
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)

['insult', 'profanity_obscene']
