<img src="https://cdn.comet.ml/img/notebook_logo.png">

[Hugging Face](https://huggingface.co/docs) is a community and data science platform that provides tools that enable users to build, train and deploy ML models based on open source (OS) code and technologies. Primarily known for their `transformers` library, Hugging Face has helped democratized access to these models by providing a unified API to train and evaluate a number of popular models for NLP. 

Comet integrates with Hugging Face's `Trainer` object, allowing you to log your model parameters, metrics, and assets such as model checkpoints. Learn more about our integration [here](https://www.comet.com/docs/v2/integrations/ml-frameworks/huggingface/) 

Curious about how Comet can help you build better models, faster? Find out more about [Comet](https://www.comet.com/site/products/ml-experiment-tracking/?utm_campaign=transformers&utm_medium=colab) and our [other integrations](https://www.comet.ml/docs/v2/integrations/overview/)


Get a preview for what's to come. Check out a completed experiment created from this notebook [here](https://www.comet.com/examples/comet-examples-transformers-trainer/3992ddee441f446bbb65c3cc4c8bd33b)

# Install Comet and Dependencies

In [1]:
# %pip install comet_ml torch datasets transformers scikit-learn

# Initialize Comet

In [2]:
import comet_ml

comet_ml.init(project_name="comet-examples-transfomers-trainer")

COMET INFO: Comet API key is valid


# Set Model Type

In [3]:
PRE_TRAINED_MODEL_NAME = "distilbert-base-uncased"
SEED = 42

# Load Data

In [6]:
from transformers import AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Downloading builder script: 4.31kB [00:00, 669kB/s]                    
Downloading metadata: 2.17kB [00:00, 586kB/s]                    


Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /Users/toon.weyens/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data: 100%|██████████| 84.1M/84.1M [00:14<00:00, 5.77MB/s]
                                                                                             

Dataset imdb downloaded and prepared to /Users/toon.weyens/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 241.38it/s]


# Setup Tokenizer

In [7]:
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 8.90kB/s]
Downloading config.json: 100%|██████████| 483/483 [00:00<00:00, 127kB/s]
Downloading vocab.txt: 100%|██████████| 226k/226k [00:00<00:00, 725kB/s] 
Downloading tokenizer.json: 100%|██████████| 455k/455k [00:00<00:00, 908kB/s] 


In [8]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

100%|██████████| 25/25 [00:11<00:00,  2.16ba/s]
100%|██████████| 25/25 [00:11<00:00,  2.17ba/s]
100%|██████████| 50/50 [00:24<00:00,  2.07ba/s]


In [9]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create Sample Datasets

For this guide, we are only going to sample 200 examples from our dataset.  

In [10]:
train_dataset = tokenized_datasets["train"].shuffle(seed=SEED).select(range(200))
eval_dataset = tokenized_datasets["test"].shuffle(seed=SEED).select(range(200))

# Setup Transformer Model

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    PRE_TRAINED_MODEL_NAME, num_labels=2
)

Downloading pytorch_model.bin: 100%|██████████| 256M/256M [00:06<00:00, 41.5MB/s] 
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased 

# Setup Evaluation Function

In [12]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


def get_example(index):
    return eval_dataset[index]["text"]


def compute_metrics(pred):
    experiment = comet_ml.get_global_experiment()

    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average="macro"
    )
    acc = accuracy_score(labels, preds)

    if experiment:
        epoch = int(experiment.curr_epoch) if experiment.curr_epoch is not None else 0
        experiment.set_epoch(epoch)
        experiment.log_confusion_matrix(
            y_true=labels,
            y_predicted=preds,
            file_name=f"confusion-matrix-epoch-{epoch}.json",
            labels=["negative", "postive"],
            index_to_example_function=get_example,
        )

    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

# Run Training

In order to enable logging from the Hugging Face Trainer, you will need to set the `COMET_MODE` environment variable to `ONLINE`.  If you would like to log assets produced in the training run as Comet Assets, set `COMET_LOG_ASSETS=TRUE`   

In [13]:
%env COMET_MODE=ONLINE
%env COMET_LOG_ASSETS=TRUE

training_args = TrainingArguments(
    seed=SEED,
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=25,
    save_strategy="steps",
    save_total_limit=10,
    save_steps=25,
    per_device_train_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 200
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 25


env: COMET_MODE=ONLINE
env: COMET_LOG_ASSETS=TRUE
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


COMET INFO: Experiment is live on comet.ml https://www.comet.com/toonweyens/comet-examples-transfomers-trainer/a5b73b145fc847bc92739bc2dc2c94ad

Automatic Comet.ml online logging enabled
100%|██████████| 25/25 [07:33<00:00, 17.86s/it]The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8

100%|██████████| 25/25 [09:55<00:00, 17.86s/it]Saving model checkpoint to ./results/checkpoint-25
Configuration saved in ./results/checkpoint-25/config.json


{'eval_loss': 0.6770908832550049, 'eval_accuracy': 0.575, 'eval_f1': 0.47005829358770534, 'eval_precision': 0.7090090090090091, 'eval_recall': 0.5580929487179487, 'eval_runtime': 142.6947, 'eval_samples_per_second': 1.402, 'eval_steps_per_second': 0.175, 'epoch': 1.0}


Model weights saved in ./results/checkpoint-25/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 25/25 [09:56<00:00, 17.86s/it]Logging checkpoints. This may take time.


{'train_runtime': 602.5312, 'train_samples_per_second': 0.332, 'train_steps_per_second': 0.041, 'train_loss': 0.6857193756103516, 'epoch': 1.0}


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.com/toonweyens/comet-examples-transfomers-trainer/a5b73b145fc847bc92739bc2dc2c94ad
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     epoch                    : 1.0
COMET INFO:     eval_accuracy            : 0.575
COMET INFO:     eval_f1                  : 0.47005829358770534
COMET INFO:     eval_loss                : 0.6770908832550049
COMET INFO:     eval_precision           : 0.7090090090090091
COMET INFO:     eval_recall              : 0.5580929487179487
COMET INFO:     eval_runtime             : 142.6947
COMET INFO:     eval_samples_per_second  : 1.402
COMET INFO:     eval_steps_per_second    : 0.175
COMET INFO:     loss [3]                 : (0.660984218120575, 0.7102903723716736)
COMET INFO:     total_flos               : 26493479731200

TrainOutput(global_step=25, training_loss=0.6857193756103516, metrics={'train_runtime': 602.5312, 'train_samples_per_second': 0.332, 'train_steps_per_second': 0.041, 'train_loss': 0.6857193756103516, 'epoch': 1.0})