# Overview
This jupyter notebook shows an example to use torch profiler to profile the huggingface model, export the data, and enable module-wise profiling.

### Reference

1. [hf_pipeline_prof.py](https://github.com/yqhu/profiler-workshop/blob/c8d4a7c30a61cc7b909d89f88f5fd36b70c55769/hf_pipeline_prof.py) demonstrates how to export the profiling results as json traces and FlameGraph.
2. [hf_training_trainer_prof.py](https://github.com/yqhu/profiler-workshop/blob/c8d4a7c30a61cc7b909d89f88f5fd36b70c55769/hf_training_trainer_prof.py) demonstrates how to profile a huggingface model via registering TrainerCallback.
3. [hf_training_torch_prof.py](https://github.com/yqhu/profiler-workshop/blob/c8d4a7c30a61cc7b909d89f88f5fd36b70c55769/hf_training_torch_prof.py) demonstrates how to run the Huggingface model in steps and profile it via PyTorch profiler in native manner.

In [1]:
from datasets import load_dataset, load_metric
from transformers import (
    AutoTokenizer,
    DataCollatorWithPadding,
    AutoModelForSequenceClassification,
)
from transformers import Trainer, TrainingArguments, TrainerCallback
import torch
import numpy as np
import time


raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(
        example["sentence1"], example["sentence2"], truncation=True
    )


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=2
)


def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments(
    "test-trainer", evaluation_strategy="epoch", num_train_epochs=1, fp16=True
)
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=2
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

start = time.perf_counter()


class ProfCallback(TrainerCallback):
    def __init__(self, prof):
        self.prof = prof

    def on_step_end(self, args, state, control, **kwargs):
        self.prof.step()

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(
        skip_first=3, wait=1, warmup=1, active=2, repeat=2
    ),
    # on_trace_ready=torch.profiler.tensorboard_trace_handler(
    #    "hf-training-trainer"
    # ), # This saves the trace to disk
    # on_trace_ready=torch.profiler.tensorboard_trace_handler("test_tensorboard")
    profile_memory=True,
    with_stack=True,
    with_modules=True,
    # The following is needed to not export empty stack https://github.com/pytorch/pytorch/issues/100253#issuecomment-1579804477
    experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True),
    record_shapes=True,
) as prof:
    # with torch.autograd.profiler.profile(
    #         # on_trace_ready=torch.profiler.tensorboard_trace_handler(
    #         #    "hf-training-trainer"
    #         # ), # This saves the trace to disk
    #         with_modules=True,
    #         # The following is needed to not export empty stack https://github.com/pytorch/pytorch/issues/100253#issuecomment-1579804477
    #         experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True),
    #     ) as prof2:
    trainer.add_callback(ProfCallback(prof=prof))
    trainer.train()

print(f"training time, {(time.perf_counter() - start):.1f} s")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.386602,0.833333,0.884354


[W kineto_shim.cpp:372] Profiler is not initialized: skipping step() invocation
[W kineto_shim.cpp:372] Profiler is not initialized: skipping step() invocation
[W kineto_shim.cpp:372] Profiler is not initialized: skipping step() invocation
STAGE:2024-01-17 12:42:18 2376418:2376418 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-17 12:42:18 2376418:2376418 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-17 12:42:18 2376418:2376418 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
STAGE:2024-01-17 12:42:23 2376418:2376418 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-17 12:42:23 2376418:2376418 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-17 12:42:23 2376418:2376418 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
  metric = load_metric("glue", "mrpc")
You can avoid this message in future by passing the argument `trust_remote_code=T

training time, 33.0 s


In [2]:
print(prof.key_averages(group_by_stack_n=5).table())

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------------------------------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  Source Location                                                              
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ---------------------------------------------------------------------------  
     

Use the following cell magic if tensorboard is exported. Notice that if tensorboard handler is used to export tensorboard profiling results, the profiler cannot export any more other traces.
    
    ```python
    %load_ext tensorboard
    %tensorboard --logdir /home/kwu/cupy-playground/intrasm_engine/benchmark/test_tensorboard
    ```

In [10]:

prof.export_chrome_trace("chrome_trace.json")
prof.export_stacks("stacks.json", "self_cuda_time_total")
# use Brendan Gregg's FlameGraph tool to generate flamegraph/flamechart
# git clone https://github.com/brendangregg/FlameGraph
# ../FlameGraph/flamegraph.pl --title "FlameGraph" --countname "us." stacks.json > perf_viz.svg
# ../FlameGraph/flamegraph.pl --title "FlameChart" --countname "us." --flamechart stacks.json > perf_chart.svg
!/home/kwu/FlameGraph/flamegraph.pl --title "FlameGraph" --countname "us." stacks.json > perf_viz.svg
!/home/kwu/FlameGraph/flamegraph.pl --title "FlameChart" --countname "us." --flamechart stacks.json > perf_chart.svg

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6013 (pid 2407179), started 0:02:11 ago. (Use '!kill 2407179' to kill it.)

In [None]:
# The following code produce order of magnitude larger trace file and takes much longer to run
# Use with caution
if False:
    trainer = Trainer(
        model,
        training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )
    with torch.autograd.profiler.profile(
            # on_trace_ready=torch.profiler.tensorboard_trace_handler(
            #    "hf-training-trainer"
            # ), # This saves the trace to disk
            with_modules=True,
            # The following is needed to not export empty stack https://github.com/pytorch/pytorch/issues/100253#issuecomment-1579804477
            experimental_config=torch._C._profiler._ExperimentalConfig(verbose=True),
        ) as prof2:
        trainer.train()
    prof2.export_chrome_trace("chrome_trace2.json")

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
STAGE:2024-01-17 12:32:35 2364529:2364529 ActivityProfilerController.cpp:312] Completed Stage: Warm Up


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.496185,0.840686,0.891122


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
STAGE:2024-01-17 12:33:09 2364529:2364529 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-17 12:33:11 2364529:2364529 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
