# A Practical Guide to Improving Performance: Optimizing For Throughput

This notebook serves as a practical guide to demonstrate how you can tune the performance of your model on Tenstorrent hardware by increasing the batch size of inputs. It will also demonstrate the appropriate way of benchmarking models on AI hardware by separating the compilation time from the run time.

The tutorial will walk through an example of running the [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) model on Tenstorrent AI accelerator hardware. The model weights will be directly downloaded from the [HuggingFace library](https://huggingface.co/docs/transformers/model_doc/bert) and executed through the PyBUDA SDK.

## Guide Overview

In this guide, we will talk through the steps for running the BERT model trained on the [SST2](https://nlp.stanford.edu/sentiment/index.html) dataset for the **Text Classification** task.

You will learn how to vary the input batch size of the model to achieve higher throughput performance. You will also learn how to configure a benchmark framework for evaluating the model performance.

## Step 1: Import libraries

Make sure that you have an activate Python environment with the latest version of PyBUDA installed.

We will start by first pip installing the `evaluate` library which will be used to calculate the accuracy metric.

In [6]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install evaluate==0.4.0

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu


In [7]:
# import the pybuda library and additional libraries required for this tutorial
import time
from typing import Any, Dict, List, Tuple

import evaluate
import pybuda
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
from transformers import BertForSequenceClassification, BertTokenizer

## Step 2: Create helper classes and functions

We will create some helper classes and functions to improve code reusability throughout this tutorial.

* `SST2Dataset` -- Python Class to hold a preprocessed version of the SST2 dataset used for evaluation
* `eval_fn` -- function to compute the evaluation score

In [8]:
# Create a Dataset Class to preprocess the data
class SST2Dataset(Dataset):
    """Configurable SST-2 Dataset."""

    def __init__(self, dataset: Any, tokenizer: Any, split: str, seq_len: int):
        """
        Init and preprocess SST-2 dataset.

        Parameters
        ----------
        dataset : Any
            SST-2 dataset
        tokenizer : Any
            tokenizer object from HuggingFace
        split : str
            Which split to use i.e. ["train", "validation", "test"]
        seq_len : int
            Sequence length
        """
        self.sst2 = dataset[split]
        self.data = [
            (
                tokenizer(
                    item["sentence"],
                    return_tensors="pt",
                    max_length=seq_len,
                    padding="max_length",
                    return_token_type_ids=False,
                    truncation=True,
                ),
                item["label"],
            )
            for item in self.sst2
        ]

        for data in self.data:
            tokenized = data[0]
            for item in tokenized:
                tokenized[item] = tokenized[item].squeeze()

    def __len__(self) -> int:
        """
        Return length of dataset.

        Returns
        -------
        int
            Length of dataset
        """
        return len(self.data)

    def __getitem__(self, index: int) -> Tuple[Dict[str, torch.Tensor], int]:
        """
        Return sample from dataset.

        Parameters
        ----------
        index : int
            Index of sample

        Returns
        -------
        Tuple
            Data sample in format of X, y
        """
        X, y = self.data[index]
        return X, y

In [9]:
# Define evaluation function
def eval_fn(outputs: List[torch.tensor], labels: List[int], metric_type: str) -> float:
    """
    Evaluation function for measuring model accuracy.

    Parameters
    ----------
    outputs : List[torch.tensor]
        Predicted outputs from model
    labels : List[int]
        List of true labels
    metric_type : str
        Type of metric to return i.e. accuracy, recall, precision, etc.

    Returns
    -------
    float
        Evaluation score.
    """

    # set evaluation metric for dataset
    accuracy_metric = evaluate.load(metric_type)

    # initialize lists to store predictions and labels
    pred_labels = []
    true_labels = []

    # store all predictions
    for output in outputs:
        pred_labels.extend(torch.argmax(output, axis=-1))

    # store all labels
    for label in labels:
        true_labels.extend(label)

    # compute the accuracy
    eval_score = accuracy_metric.compute(references=true_labels, predictions=pred_labels)

    return eval_score[metric_type]

## Step 3: Download the model weights from HuggingFace

In [10]:
# Load BERT tokenizer and model from HuggingFace for text classification task
model_ckpt = "textattack/bert-base-uncased-SST-2"
tokenizer = BertTokenizer.from_pretrained(model_ckpt)
model = BertForSequenceClassification.from_pretrained(model_ckpt)

## Step 4: Set optimal configurations

For every model, you can adjust TT-BUDA configuration parameters to achieve optimized performance. Some key parameters include:

* Data format e.g. BFP8, FP16_b, FP16, etc.
* Math fidelity
* Balancer policy
* etc...

For a full list of tuneable parameters, please refer to the TT-BUDA documentation: <https://docs.tenstorrent.com/tenstorrent/>

In [11]:
# Set optimal configurations
compiler_cfg = pybuda.config._get_global_compiler_config()
compiler_cfg.default_df_override = pybuda._C.DataFormat.Float16_b
compiler_cfg.enable_auto_transposing_placement = True
compiler_cfg.balancer_policy = "Ribbon"

## Step 5: Instantiate Tenstorrent device

The first time we use PyBUDA, we must initialize a `TTDevice` object which serves as the abstraction over the target hardware.

In [12]:
tt0 = pybuda.TTDevice(
    name="tt_device_0",  # here we can give our device any name we wish, for tracking purposes
    arch=pybuda.BackendDevice.Grayskull  # we set the target device architecture to compile for
)

## Step 6: Create a PyBUDA module from PyTorch model

Next, we must abstract the PyTorch model loaded from HuggingFace into a `pybuda.PyTorchModule` object. This will let the BUDA compiler know which model architecture and AI framework it has to compile.

We then "place" this module onto the previously initialized `TTDevice`.

In [13]:
# Create module
pybuda_module = pybuda.PyTorchModule(
    name = "pt_bert_text_classification",  # give the module a name, this will be used for tracking purposes
    module=model  # specify the model that is being targeted for compilation
)

# Place module on device
tt0.place_module(module=pybuda_module)

## Step 7: Load the SST2 dataset for evaluation

In [14]:
dataset = SST2Dataset(dataset=load_dataset("glue", "sst2"), tokenizer=tokenizer, split="validation", seq_len=128)

Downloading builder script: 28.8kB [00:00, 20.8MB/s]                   
Downloading metadata: 28.7kB [00:00, 5.09MB/s]                   


Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, post-processed: Unknown size, total: 11.90 MiB) to /home/jonathan/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data: 100%|██████████| 7.44M/7.44M [00:00<00:00, 57.2MB/s]
                                                                                       

Dataset glue downloaded and prepared to /home/jonathan/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 1209.31it/s]


## Step 8: Set the batch size, prep the dataset, and load a sample input

In [15]:
# set batch size
batch_size = 64

# prepare the dataset for specified batch size
generator = DataLoader(dataset, batch_size=batch_size, shuffle=False, drop_last=True)

# get sample input
sample_input, _ = next(iter(generator))

## Step 9: Compile the model with fixed batch size

In [16]:
start_compilation_time = time.time()
output_q = pybuda.initialize_pipeline(training=False, sample_inputs=list(sample_input.values()))
end_compilation_time = time.time()



[32m2024-03-06 14:39:21.005[0m | [1m[38;2;100;149;237mINFO    [0m | [36mAlways         [0m - initialize_child_process called on pid 577263


  jax.tree_util.register_keypaths(data_clz, keypaths)
  jax.tree_util.register_keypaths(data_clz, keypaths)
2024-03-06 14:39:34.047 | INFO     | tvm.relay.op.contrib.buda.buda:visit_call:817 - Adding: embedding to fallback
2024-03-06 14:39:34.048 | INFO     | tvm.relay.op.contrib.buda.buda:visit_call:817 - Adding: embedding to fallback
2024-03-06 14:39:34.051 | INFO     | tvm.relay.op.contrib.buda.buda:visit_call:817 - Adding: embedding to fallback
2024-03-06 14:39:37.610 | INFO     | tvm.relay.op.contrib.buda.buda:_cpu_eval:562 - cast will be executed on CPU
2024-03-06 14:39:37.611 | INFO     | tvm.relay.op.contrib.buda.buda:_cpu_eval:562 - strided_slice will be executed on CPU
2024-03-06 14:39:37.612 | INFO     | tvm.relay.op.contrib.buda.buda:_cpu_eval:562 - broadcast_to will be executed on CPU
2024-03-06 14:39:37.612 | INFO     | tvm.relay.op.contrib.buda.buda:_cpu_eval:562 - cast will be executed on CPU
2024-03-06 14:39:37.613 | INFO     | tvm.relay.op.contrib.buda.buda:_cpu_eval:

[32m2024-03-06 14:39:45.525[0m | [1m[38;2;100;149;237mINFO    [0m | [36mAlways         [0m - initialize_child_process called on pid 579546
[32m2024-03-06 14:39:45.665[0m | [1m[38;2;100;149;237mINFO    [0m | [36mAlways         [0m - initialize_child_process called on pid 579556
[32m2024-03-06 14:39:45.681[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 1 PCI device
[32m2024-03-06 14:39:45.687[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0xfaca revision: 0)


2024-03-06 14:39:45.731 | INFO     | pybuda.compile:pybuda_compile:220 - Device grid size: r = 10, c = 12
2024-03-06 14:39:45.732 | INFO     | pybuda.compile:pybuda_compile:230 - Using chips: [0]
2024-03-06 14:39:45.732 | INFO     | pybuda.compile:pybuda_compile:246 - Generating initial graph
2024-03-06 14:39:45.992 | INFO     | pybuda.compile:pybuda_compile:319 - Running post initial graph pass
2024-03-06 14:39:46.394 | INFO     | pybuda.compile:pybuda_compile:391 - Running post autograd graph pass
2024-03-06 14:39:46.571 | INFO     | pybuda.compile:pybuda_compile:424 - Lowering to Buda


[32m2024-03-06 14:39:46.710[0m | [1m[38;2;255;069;000mINFO    [0m | [36mGraphCompiler  [0m - Running with Automatic Mixed Precision Level = 0.
[32m2024-03-06 14:39:46.766[0m | [1m[38;2;255;069;000mINFO    [0m | [36mAlways         [0m - Running Balancer with Policy: PolicyType::Ribbon
[32m2024-03-06 14:39:47.510[0m | [1m[38;2;255;069;000mINFO    [0m | [36mAlways         [0m - Running Balancer with Policy: PolicyType::Ribbon
[32m2024-03-06 14:39:48.831[0m | [1m[38;2;255;069;000mINFO    [0m | [36mBalancer       [0m - Starting Ribbon balancing.
[32m2024-03-06 14:39:49.117[0m | [1m[38;2;255;069;000mINFO    [0m | [36mBalancer       [0m - Balancing 1% complete.
[32m2024-03-06 14:39:49.366[0m | [1m[38;2;255;069;000mINFO    [0m | [36mBalancer       [0m - Balancing 4% complete.
[32m2024-03-06 14:39:49.713[0m | [1m[38;2;255;069;000mINFO    [0m | [36mBalancer       [0m - Balancing 19% complete.
[32m2024-03-06 14:39:50.099[0m | [1m[38;2;255;069;0

2024-03-06 14:39:53.584 | INFO     | pybuda.compile:pybuda_compile:626 - Generating Netlist
2024-03-06 14:39:53.816 | INFO     | pybuda.ci:create_symlink:85 - Symlink created from /home/jonathan/Desktop/tenstorrent/tt-buda-demos/first_5_steps/pt_bert_text_classification_tt_1_netlist.yaml to /tmp/jonathan/8148d80c3efc/pt_bert_text_classification_tt_1_netlist.yaml
2024-03-06 14:39:55.800 | DEBUG    | pybuda.tensor:consteval_tensor:1177 - ConstEval graph: input_1_multiply_18
2024-03-06 14:39:55.801 | DEBUG    | pybuda.tensor:consteval_tensor:1177 - ConstEval graph: input_0_subtract_21
2024-03-06 14:39:55.801 | DEBUG    | pybuda.tensor:consteval_tensor:1177 - ConstEval graph: input_1_multiply_22
2024-03-06 14:39:55.801 | DEBUG    | pybuda.tensor:consteval_tensor:1177 - ConstEval graph: input_1_multiply_75
2024-03-06 14:39:55.802 | DEBUG    | pybuda.tensor:consteval_tensor:1177 - ConstEval graph: input_1_multiply_128
2024-03-06 14:39:55.802 | DEBUG    | pybuda.tensor:consteval_tensor:1177 -

[32m2024-03-06 14:39:55.937[0m | [1m[38;2;100;149;237mINFO    [0m | [36mAlways         [0m - Running tt_runtime on host: 'benderv2'
[32m2024-03-06 14:39:55.937[0m | [1m[38;2;100;149;237mINFO    [0m | [36mPerfInfra      [0m - Backend profiler is disabled
[32m2024-03-06 14:39:55.937[0m | [1m[38;2;100;149;237mINFO    [0m | [36mNetlist        [0m - Parsing Netlist from file: /tmp/jonathan/8148d80c3efc/pt_bert_text_classification_tt_1_netlist.yaml
[32m2024-03-06 14:39:56.151[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 1 PCI device
[32m2024-03-06 14:39:56.152[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0xfaca revision: 0)
[32m2024-03-06 14:39:56.451[0m | [1m[38;2;100;149;237mINFO    [0m | [36mAlways         [0m - Using Default BRISC Bin
[32m2024-03-06 14:39:56.451[0m | [1m[38;2;100;149;237mINFO    [0m | [36mCompil

2024-03-06 14:40:34.853 | INFO     | pybuda.backend:feeder_thread_main:120 - Feeder thread on <pybuda.backend.BackendAPI object at 0x7f1472c8bfa0> starting
2024-03-06 14:40:34.853 | DEBUG    | pybuda.backend:push_constants_and_parameters:435 - Pushing to constant lc.input_tensor.layernorm_0.dc.reduce_sum.0.0
2024-03-06 14:40:34.854 | DEBUG    | pybuda.backend:push_constants_and_parameters:435 - Pushing to constant dc.input_tensor.layernorm_0.1
2024-03-06 14:40:34.854 | DEBUG    | pybuda.backend:push_constants_and_parameters:435 - Pushing to constant lc.input_tensor.layernorm_0.dc.reduce_sum.5.0
2024-03-06 14:40:34.854 | DEBUG    | pybuda.backend:push_constants_and_parameters:435 - Pushing to constant dc.input_tensor.layernorm_0.6
2024-03-06 14:40:34.854 | DEBUG    | pybuda.backend:push_constants_and_parameters:435 - Pushing to constant dc.input_tensor.layernorm_0.8
2024-03-06 14:40:34.854 | DEBUG    | pybuda.backend:push_constants_and_parameters:435 - Pushing to constant input_1_multip

[32m2024-03-06 14:40:35.016[0m | [1m[38;2;100;149;237mINFO    [0m | [36mNetlist        [0m - Parsing Netlist from file: /tmp/jonathan/8148d80c3efc/pt_bert_text_classification_tt_1_netlist.yaml
[32m2024-03-06 14:40:35.206[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 1 PCI device
[32m2024-03-06 14:40:35.229[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0xfaca revision: 0)
[32m2024-03-06 14:40:35.250[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Disable PCIE DMA
[32m2024-03-06 14:40:35.251[0m | [1m[38;2;100;149;237mINFO    [0m | [36mNetlist        [0m - Parsing Netlist from file: /tmp/jonathan/8148d80c3efc/pt_bert_text_classification_tt_1_netlist.yaml


## Step 10: Run benchmark on SST2 dataset with `batch_size==64`

In [17]:
# Run benchmark loop
store_outputs = []
store_labels = []
start_runtime_time = time.time()
for batch, labels in generator:
    # push input to Tenstorrent device
    tt0.push_to_inputs(batch)

    # run inference on Tenstorrent device
    pybuda.run_forward(input_count=1)
    output = output_q.get()  # inference will return a queue object, get last returned object

    # store outputs
    store_labels.append(labels)
    store_outputs.append(output[0].value())
end_runtime_time = time.time()

# Process output times
total_runtime_time = end_runtime_time - start_runtime_time
total_compilation_time = end_compilation_time - start_compilation_time
total_samples = len(generator) *  batch_size
eval_score = eval_fn(store_outputs, store_labels, "accuracy")

2024-03-06 14:40:35.262 | INFO     | pybuda.device:push_to_inputs:216 - push_to_inputs redirected from TTDevice 'tt_device_0' to CPUDevice 'cpu0_fallback'
2024-03-06 14:40:35.263 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: CPUDevice 'cpu0_fallback'
2024-03-06 14:40:35.264 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt_device_0'


[32m2024-03-06 14:40:35.266[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]


2024-03-06 14:40:35.265 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on TTDevice 'tt_device_0' / 579556
2024-03-06 14:40:35.265 | DEBUG    | pybuda.ttdevice:forward:862 - Starting forward on TTDevice 'tt_device_0'
2024-03-06 14:40:35.266 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: fwd
2024-03-06 14:40:35.266 | DEBUG    | pybuda.backend:read_queues:316 - Reading output queue pt_bert_text_classification_tt_1.output_add_651


[32m2024-03-06 14:40:35.452[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 1 PCI device


2024-03-06 14:40:35.503 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on CPUDevice 'cpu0_fallback' / 579546
2024-03-06 14:40:35.503 | DEBUG    | pybuda.cpudevice:forward_pt:191 - Starting forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:35.521 | DEBUG    | pybuda.cpudevice:forward_pt:265 - Ending forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:35.521 | DEBUG    | pybuda.device_connector:pusher_thread_main:159 - Pusher thread pushing tensors
2024-03-06 14:40:35.522 | DEBUG    | pybuda.backend:push_to_queues:407 - Pushing to queue pybuda_6_i0
2024-03-06 14:40:35.528 | DEBUG    | pybuda.backend:push_to_queues:407 - Pushing to queue attention_mask_1
2024-03-06 14:40:35.638 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-06 14:40:35.638 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue pt_bert_text_classification_tt_1.output_add_651
2024-03-06 14:40:35.640 | INFO     | pybuda.device:push_to_inputs:216 - push

[32m2024-03-06 14:40:35.483[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0xfaca revision: 0)
[32m2024-03-06 14:40:35.502[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Disable PCIE DMA
[32m2024-03-06 14:40:35.642[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]


2024-03-06 14:40:35.642 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on CPUDevice 'cpu0_fallback' / 579546
2024-03-06 14:40:35.642 | DEBUG    | pybuda.cpudevice:forward_pt:191 - Starting forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:35.642 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on TTDevice 'tt_device_0' / 579556
2024-03-06 14:40:35.642 | DEBUG    | pybuda.ttdevice:forward:862 - Starting forward on TTDevice 'tt_device_0'
2024-03-06 14:40:35.642 | DEBUG    | pybuda.backend:feeder_thread_main:142 - Run feeder thread cmd: fwd
2024-03-06 14:40:35.643 | DEBUG    | pybuda.backend:read_queues:316 - Reading output queue pt_bert_text_classification_tt_1.output_add_651
2024-03-06 14:40:35.657 | DEBUG    | pybuda.cpudevice:forward_pt:265 - Ending forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:35.657 | DEBUG    | pybuda.device_connector:pusher_thread_main:159 - Pusher thread pushing tensors
2024-03-06 14:40:35.

[32m2024-03-06 14:40:35.778[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
[32m2024-03-06 14:40:35.907[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]


2024-03-06 14:40:36.033 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-06 14:40:36.034 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue pt_bert_text_classification_tt_1.output_add_651
2024-03-06 14:40:36.035 | INFO     | pybuda.device:push_to_inputs:216 - push_to_inputs redirected from TTDevice 'tt_device_0' to CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.035 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.036 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt_device_0'
2024-03-06 14:40:36.037 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on CPUDevice 'cpu0_fallback' / 579546
2024-03-06 14:40:36.037 | DEBUG    | pybuda.cpudevice:forward_pt:191 - Starting forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.037 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD com

[32m2024-03-06 14:40:36.038[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
[32m2024-03-06 14:40:36.168[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]


2024-03-06 14:40:36.292 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-06 14:40:36.292 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue pt_bert_text_classification_tt_1.output_add_651
2024-03-06 14:40:36.293 | INFO     | pybuda.device:push_to_inputs:216 - push_to_inputs redirected from TTDevice 'tt_device_0' to CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.294 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.295 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt_device_0'
2024-03-06 14:40:36.295 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on CPUDevice 'cpu0_fallback' / 579546
2024-03-06 14:40:36.295 | DEBUG    | pybuda.cpudevice:forward_pt:191 - Starting forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.297 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD com

[32m2024-03-06 14:40:36.297[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
[32m2024-03-06 14:40:36.425[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]


2024-03-06 14:40:36.549 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-06 14:40:36.550 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue pt_bert_text_classification_tt_1.output_add_651
2024-03-06 14:40:36.551 | INFO     | pybuda.device:push_to_inputs:216 - push_to_inputs redirected from TTDevice 'tt_device_0' to CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.552 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.553 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt_device_0'
2024-03-06 14:40:36.554 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on CPUDevice 'cpu0_fallback' / 579546
2024-03-06 14:40:36.554 | DEBUG    | pybuda.cpudevice:forward_pt:191 - Starting forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.554 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD com

[32m2024-03-06 14:40:36.555[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
[32m2024-03-06 14:40:36.684[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]


2024-03-06 14:40:36.810 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-06 14:40:36.810 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue pt_bert_text_classification_tt_1.output_add_651
2024-03-06 14:40:36.811 | INFO     | pybuda.device:push_to_inputs:216 - push_to_inputs redirected from TTDevice 'tt_device_0' to CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.812 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.813 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt_device_0'
2024-03-06 14:40:36.813 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on CPUDevice 'cpu0_fallback' / 579546
2024-03-06 14:40:36.813 | DEBUG    | pybuda.cpudevice:forward_pt:191 - Starting forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:36.814 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD com

[32m2024-03-06 14:40:36.814[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]
[32m2024-03-06 14:40:36.945[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]


2024-03-06 14:40:37.068 | DEBUG    | pybuda.backend:read_queues:376 - Done reading queues
2024-03-06 14:40:37.068 | DEBUG    | pybuda.backend:pop_queues:382 - Popping from queue pt_bert_text_classification_tt_1.output_add_651
2024-03-06 14:40:37.070 | INFO     | pybuda.device:push_to_inputs:216 - push_to_inputs redirected from TTDevice 'tt_device_0' to CPUDevice 'cpu0_fallback'
2024-03-06 14:40:37.070 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: CPUDevice 'cpu0_fallback'
2024-03-06 14:40:37.071 | DEBUG    | pybuda.run.impl:_run_forward:641 - Running concurrent device forward: TTDevice 'tt_device_0'
2024-03-06 14:40:37.071 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD command on CPUDevice 'cpu0_fallback' / 579546
2024-03-06 14:40:37.072 | DEBUG    | pybuda.cpudevice:forward_pt:191 - Starting forward on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:37.072 | DEBUG    | pybuda.device:run_next_command:426 - Received RUN_FORWARD com

[32m2024-03-06 14:40:37.073[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Running program 'run_fwd_0' with params [("$p_loop_count", "1")]


Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 1.13MB/s]


In [18]:
# Display results
print("Benchmark Result")
print(f" Model compilation time: {total_compilation_time:.3f}s")
print(f" Total runtime time for {total_samples} inputs: {total_runtime_time:.3f}s")
print(f" Throughput: {(total_samples / total_runtime_time):.1f} samples/s")
print(f" Accuracy: {(eval_score * 100):.1f}%")

Benchmark Result
 Model compilation time: 74.247s
 Total runtime time for 832 inputs: 1.939s
 Throughput: 429.1 samples/s
 Accuracy: 92.1%


## Step 11: Shutdown PyBuda

In [19]:
pybuda.shutdown()

2024-03-06 14:40:37.574 | DEBUG    | pybuda.run.impl:_shutdown:1262 - PyBuda shutdown
2024-03-06 14:40:37.575 | DEBUG    | pybuda.device:run_next_command:416 - Received SHUTDOWN command on TTDevice 'tt_device_0'
2024-03-06 14:40:37.575 | DEBUG    | pybuda.device:run_next_command:416 - Received SHUTDOWN command on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:37.575 | DEBUG    | pybuda.device:run_next_command:419 - Waiting for barrier on TTDevice 'tt_device_0'
2024-03-06 14:40:37.575 | DEBUG    | pybuda.device:run_next_command:419 - Waiting for barrier on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:37.575 | DEBUG    | pybuda.run.impl:_shutdown:1278 - Waiting until processes done
2024-03-06 14:40:37.575 | DEBUG    | pybuda.device:run_next_command:421 - Shutting down on TTDevice 'tt_device_0'
2024-03-06 14:40:37.575 | DEBUG    | pybuda.device:run_next_command:421 - Shutting down on CPUDevice 'cpu0_fallback'
2024-03-06 14:40:37.583 | DEBUG    | pybuda.device:atexit_handler:919 - atexit handler

[32m2024-03-06 14:40:37.575[0m | [1m[38;2;100;149;237mINFO    [0m | [36mAlways         [0m - finish_child_process called on pid 579546
[32m2024-03-06 14:40:37.583[0m | [1m[38;2;100;149;237mINFO    [0m | [36mAlways         [0m - finish_child_process called on pid 579546
[32m2024-03-06 14:40:37.603[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Waiting for cluster completion
[32m2024-03-06 14:40:37.604[0m | [1m[38;2;100;149;237mINFO    [0m | [36mPerfPostProcess[0m - Writing the host postprocess report in /tmp/jonathan/8148d80c3efc/perf_results//host/device_alignment_th_1575208074_proc_579556.json
[32m2024-03-06 14:40:37.654[0m | [1m[38;2;100;149;237mINFO    [0m | [36mRuntime        [0m - Closed all devices successfully
[32m2024-03-06 14:40:37.654[0m | [1m[38;2;100;149;237mINFO    [0m | [36mPerfCheck      [0m - Starting performance check for host events
[32m2024-03-06 14:40:37.654[0m | [1m[38;2;100;149;237mINFO    [0m | [36