# Toxicity Target Type Identification

![SageMaker](https://img.shields.io/badge/SageMaker-%23FF9900.svg?style=for-the-badge&logo=amazon-aws&logoColor=white)

This notebook is a part of [ToChiquinho](https://dougtrajano.github.io/ToChiquinho/) project, which trains a model to classify the type of a given targeted toxic comment.

We used the [OLID-BR](https://dougtrajano.github.io/olid-br/) dataset as the training data. The possible values are:

- `IND`: Individual
- `GRP`: Group
- `OTH`: Other

The model is trained using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).

- [Setup](#Setup)
- [Prepare the data](#Prepare-the-data)
  - [Uploading the data to S3](#Uploading-the-data-to-S3)
- [Training process](#Training-process)
  - [Define the estimator](#Define-the-estimator)
  - [Hyperparameter tuning](#Hyperparameter-tuning)
- [Documentation](#Documentation)

## Setup

In this section, we will import the necessary libraries and set up the environment.

In the next cell, you can change the parameters to fit your needs.

In [2]:
from dotenv import load_dotenv
load_dotenv("../.env")

True

In [3]:
import os
from ml.arguments import NotebookArguments
from ml.utils import remove_checkpoints

params = NotebookArguments(
    mlflow_tracking_uri=os.environ.get("MLFLOW_TRACKING_URI"),
    mlflow_experiment_name="toxicity-target-type-identification",
    mlflow_tracking_username=os.environ.get("MLFLOW_TRACKING_USERNAME"),
    mlflow_tracking_password=os.environ.get("MLFLOW_TRACKING_PASSWORD"),
    mlflow_tags={
        "project": "ToChiquinho",
        "dataset": "OLID-BR",
        "model_type": "bert",
        "problem_type": "multi_class_classification"
    },
    sagemaker_execution_role_arn=os.environ.get("SAGEMAKER_EXECUTION_ROLE_ARN"),
    aws_profile_name=os.environ.get("AWS_PROFILE")
)

params

Parameters(num_train_epochs=10, early_stopping_patience=2, batch_size=8, validation_split=0.2, seed=1993, mlflow_experiment_name='toxicity-target-type-identification', mlflow_tags='{"project": "ToChiquinho", "dataset": "OLID-BR", "model_type": "bert", "problem_type": "multi_class_classification"}')

In [4]:
import boto3
import sagemaker

sagemaker_session = sagemaker.Session(
    boto_session=boto3.Session(profile_name=params.aws_profile_name)
)

bucket_name = sagemaker_session.default_bucket()
prefix = f"ToChiquinho/{params.mlflow_experiment_name}"

if params.sagemaker_execution_role_arn is None:
    params.sagemaker_execution_role_arn = sagemaker.get_execution_role(sagemaker_session)

## Prepare the data

In this section, we will prepare the data to be used in the training process.

We will download OLID-BR dataset from [HuggingFace Datasets](https://huggingface.co/datasets/olidbr), process it and upload it to S3 to be used in the training process.

In [6]:
from datasets import load_dataset

dataset = load_dataset("dougtrajano/olid-br")

Using custom data configuration dougtrajano--olid-br-f83aad8215e23434
Found cached dataset parquet (C:/Users/trajano/.cache/huggingface/datasets/dougtrajano___parquet/dougtrajano--olid-br-f83aad8215e23434/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
import datasets
from typing import Union

def prepare_dataset(
    dataset: Union[datasets.Dataset, datasets.DatasetDict],
    test_size: float = 0.2,
    seed: int = 42
) -> Union[datasets.Dataset, datasets.DatasetDict]:

    # Filter only rows with is_offensive = "OFF" and is_targeted = "TIN"
    dataset = dataset.filter(
        lambda example: example["is_offensive"] == "OFF" 
        and example["is_targeted"] == "TIN"
        and example["targeted_type"] is not None
    )

    # Filter only "text" and "is_targeted" columns
    dataset = dataset.remove_columns(
        [
            col for col in dataset["train"].column_names if col not in ["text", "targeted_type"]
        ]
    )
    
    dataset = dataset.rename_column("targeted_type", "label")

    # Replace "TIN": 1 and "UNT": 0
    def replace_labels(example):
        if example["label"] == "IND":
            example["label"] = 0
        elif example["label"] == "GRP":
            example["label"] = 1
        elif example["label"] == "OTH":
            example["label"] = 2
        else:
            raise ValueError(f"Invalid label: {example['label']}")
        return example

    dataset = dataset.map(replace_labels)

    train_dataset = dataset["train"].train_test_split(
        test_size=test_size,
        shuffle=True,
        seed=seed
    )

    dataset["train"] = train_dataset["train"]
    dataset["validation"] = train_dataset["test"]

    return dataset


dataset = prepare_dataset(
    dataset,
    test_size=params.validation_split,
    seed=params.seed
)

dataset

Loading cached processed dataset at C:/Users/trajano/.cache/huggingface/datasets/dougtrajano___parquet/dougtrajano--olid-br-f83aad8215e23434/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec\cache-038bf56f76cc4a37.arrow
Loading cached processed dataset at C:/Users/trajano/.cache/huggingface/datasets/dougtrajano___parquet/dougtrajano--olid-br-f83aad8215e23434/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec\cache-43b90f5725815a2a.arrow
Loading cached processed dataset at C:/Users/trajano/.cache/huggingface/datasets/dougtrajano___parquet/dougtrajano--olid-br-f83aad8215e23434/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec\cache-e0c368268074a955.arrow
Loading cached processed dataset at C:/Users/trajano/.cache/huggingface/datasets/dougtrajano___parquet/dougtrajano--olid-br-f83aad8215e23434/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec\cache-c6cac6994c04a29c.arrow
Loading cached split indices for dat

DatasetDict({
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 946
    })
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2269
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 568
    })
})

In [8]:
dataset.save_to_disk("data")

Loading cached processed dataset at C:/Users/trajano/.cache/huggingface/datasets/dougtrajano___parquet/dougtrajano--olid-br-f83aad8215e23434/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec\cache-5285d4ed43b61082.arrow
Loading cached processed dataset at C:/Users/trajano/.cache/huggingface/datasets/dougtrajano___parquet/dougtrajano--olid-br-f83aad8215e23434/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec\cache-0ff545f03babdb1e.arrow


### Uploading the data to S3

We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location.

The return value inputs identifies the location -- we will use later when we start the training job.

In [9]:
# inputs = sagemaker_session.upload_data(
#     path="data",
#     bucket=bucket_name,
#     key_prefix=f"{prefix}/data"
# )

inputs = "s3://sagemaker-us-east-1-215993976552/ToChiquinho/toxicity-target-type-identification/data"

print("input spec (in this case, just an S3 path): {}".format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-east-1-215993976552/ToChiquinho/toxicity-target-type-identification/data


In [10]:
import shutil
shutil.rmtree("data")

## Training session

In this section, we will run the training process.

To use Amazon SageMaker to run Docker containers, we need to provide a Python script for the container to run. In our case, all the code is in the `ml` folder, including the `train.py` script.

We will start doing a hyperparameter tuning process to find the best hyperparameters for our model.

Then, we will train the model using the best hyperparameters found.

In [11]:
import os
import mlflow

mlflow.start_run()

print(f"MLFlow run ID: {mlflow.active_run().info.run_id}")

MLFlow run ID: 1bf05db07db6420dba6fb3a4aedb3b51


### Define the estimator

We will use the `sagemaker.pytorch.PyTorch` class to define our estimator.

In [49]:
import logging
from sagemaker.pytorch import PyTorch

checkpoint_s3_uri = f"s3://{bucket_name}/{prefix}/checkpoints"

instance_type = "ml.g4dn.xlarge" # 4 vCPUs, 16 GB RAM, 1 x NVIDIA T4 16GB GPU - $ 0.736 per hour
# instance_type = "ml.g4dn.2xlarge" # 8 vCPUs, 32 GB RAM, 1 x NVIDIA T4 16GB GPU - $ 0.94 per hour
# instance_type = "ml.g5.xlarge" # 4 vCPUs, 16 GB RAM, 1 x NVIDIA A10G 24GB GPU - $ 1.408 per hour
# instance_type = "ml.g5.2xlarge" # 8 vCPUs, 32 GB RAM, 1 x NVIDIA A10G 24GB GPU - $ 1.515 per hour
# instance_type = "ml.g5.4xlarge" # 16 vCPUs, 64 GB RAM, 2 x NVIDIA A10G 24GB GPU - $ 2.03 per hour
# instance_type = "ml.g5.8xlarge" # 32 vCPUs, 128 GB RAM, 4 x NVIDIA A10G 24GB GPU - $ 3.06 per hour

estimator = PyTorch(
    entry_point="train.py",
    source_dir="ml",
    container_log_level=logging.DEBUG,
    role=params.sagemaker_execution_role_arn,
    sagemaker_session=sagemaker_session,
    py_version="py38",
    framework_version="1.12.0",
    instance_count=1,
    instance_type=instance_type,
    use_spot_instances=True,
    max_wait=10800,
    max_run=10800,
    checkpoint_s3_uri=checkpoint_s3_uri,
    checkpoint_local_path="/opt/ml/checkpoints",
    environment={
        "MLFLOW_TRACKING_URI": params.mlflow_tracking_uri,
        "MLFLOW_EXPERIMENT_NAME": params.mlflow_experiment_name,
        "MLFLOW_TRACKING_USERNAME": params.mlflow_tracking_username,
        "MLFLOW_TRACKING_PASSWORD": params.mlflow_tracking_password,
        "MLFLOW_TAGS": params.mlflow_tags,
        "MLFLOW_RUN_ID": mlflow.active_run().info.run_id,
        "MLFLOW_FLATTEN_PARAMS": "True",
        "HF_MLFLOW_LOG_ARTIFACTS": "True",
        "WANDB_DISABLED": "True"
    },
    hyperparameters={
        ## If you want to test the code, uncomment the following lines to use smaller datasets
        # "max_train_samples": 50,
        # "max_val_samples": 50,
        # "max_test_samples": 50,
        "num_train_epochs": params.num_train_epochs,
        "early_stopping_patience": params.early_stopping_patience,
        "eval_dataset": "validation",
        "batch_size": params.batch_size,
        "seed": params.seed
    },
)

To test our training job before hyperparameter tuning, we will run it with a small number of samples.

In [48]:
estimator.fit(inputs, wait=False)

### Hyperparameter Tuning

We will use the `sagemaker.tuner.HyperparameterTuner` class to run a hyperparameter tuning process.

We use MLflow to track the training process, so we can analyze the results through the MLflow UI.

#### Workaround for boto/boto3/issues/3488 issue

Due to the issue [Estimator.environment not using in SageMaker.Client.create_hyper_parameter_tuning_job() · Issue #3488 · boto/boto3](https://github.com/boto/boto3/issues/3488), we need to include our environment variables in the `hyperparameters` parameter.

In [50]:
for k, v in estimator.environment.items():
    if k != "MLFLOW_TAGS":
        estimator._hyperparameters[k] = v

In [51]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

estimator._hyperparameters.pop("max_train_samples", None)
estimator._hyperparameters.pop("max_val_samples", None)
estimator._hyperparameters.pop("max_test_samples", None)

tuner = HyperparameterTuner(
    estimator,
    max_jobs=18,
    max_parallel_jobs=3,
    objective_type="Maximize",
    objective_metric_name="eval_f1",
    metric_definitions=[
        {
            "Name": "eval_f1",
            "Regex": "eval_f1_weighted: ([0-9\\.]+)"
        }
    ],
    hyperparameter_ranges={
        "learning_rate": ContinuousParameter(1e-5, 1e-3),
        "weight_decay": ContinuousParameter(0.0, 0.1),
        "adam_beta1": ContinuousParameter(0.8, 0.999),
        "adam_beta2": ContinuousParameter(0.8, 0.999),
        "adam_epsilon": ContinuousParameter(1e-8, 1e-6),
        "label_smoothing_factor": ContinuousParameter(0.0, 0.1),
        "optim": CategoricalParameter(
            [
                "adamw_hf",
                "adamw_torch",
                "adamw_apex_fused",
                "adafactor"
            ]
        )
    }
)

In [52]:
tuner.fit(inputs, wait=False)

params.sagemaker_tuning_job_name = tuner.latest_tuning_job.name

print(f"SageMaker tuning job name: {params.sagemaker_tuning_job_name}")

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


SageMaker tuning job name: pytorch-training-221209-0011


In [53]:
mlflow.log_params(
    {
        "instance_type": estimator.instance_type,
        "instance_count": estimator.instance_count,
        "early_stopping_patience": params.early_stopping_patience,
        "num_train_epochs": params.num_train_epochs,
        "batch_size": params.batch_size,
        "seed": params.seed
    }
)

In [None]:
import pandas as pd

tuner_metrics: pd.DataFrame = sagemaker.HyperparameterTuningJobAnalytics(
    hyperparameter_tuning_job_name=params.sagemaker_tuning_job_name,
    sagemaker_session=sagemaker_session
).dataframe()

tuner_metrics.sort_values("FinalObjectiveValue", ascending=False, inplace=True)
tuner_metrics[["TrainingJobName", "FinalObjectiveValue", "TrainingJobStatus"]]

Now, we can sort the results by the `FinalObjectiveValue` metric and see the best hyperparameters found.

In [None]:
best_job = tuner_metrics.iloc[0]
best_job.to_dict()

In [None]:
mlflow.log_params(
    {
        "best_adam_beta1": best_job["adam_beta1"],
        "best_adam_beta2": best_job["adam_beta2"],
        "best_adam_epsilon": best_job["adam_epsilon"],
        "best_learning_rate": best_job["learning_rate"],
        "best_weight_decay": best_job["weight_decay"],
        "best_label_smoothing_factor": best_job["label_smoothing_factor"],
        "best_optim": best_job["optim"]
    }
)

In [None]:
remove_checkpoints(
    bucket_name=bucket_name,
    checkpoint_prefix=f"{prefix}/checkpoints",
    aws_profile_name=params.aws_profile_name
)

mlflow.end_run()

## Documentation

- [Estimators — sagemaker documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)
- [HyperparameterTuner — sagemaker documentation](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html)
- [Configure and Launch a Hyperparameter Tuning Job - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html)
- [Managed Spot Training in Amazon SageMaker - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html)