# Hugging Face 🤗 NLP Transformers pipelines with ONNX

![logo](assets/logo.png)

*This project is linked to the Medium blog post: [How to use Hugging Face 🤗 Transformers with ONNX in real world]() (incoming)*

## 📦 Working environment

First of all, you need to install all required dependencies. It is recommended to use and isolated environment to avoid conflicts.

You can use any package manager you want. I recommend [`conda`](https://conda.io/).

```bash
conda create -y -n hf-onnx python=3.8
```

The project requires Python 3.8 or higher.

All required dependencies are listed in the `requirements.txt` file. To install them, run the following command:


In [2]:
!pip install -r requirements.txt

Ignoring colorama: markers 'platform_system == "Windows" and python_full_version >= "3.6.0" and python_version >= "3.6"' don't match your environment
Ignoring pyreadline3: markers 'sys_platform == "win32" and python_version >= "3.8" and (python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0")' don't match your environment
Collecting charset-normalizer==2.0.12
  Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Collecting flatbuffers==2.0
  Using cached flatbuffers-2.0-py2.py3-none-any.whl (26 kB)
Collecting joblib==1.1.0
  Using cached joblib-1.1.0-py2.py3-none-any.whl (306 kB)
Collecting onnxruntime==1.10.0
  Using cached onnxruntime-1.10.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB)
Collecting psutil==5.9.0
  Using cached psutil-5.9.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (283 kB)
Collecting requests==2.27.1
  Using cached requests-2.27.1-py2.py3-none

## 🍿 Export the model to ONNX

For this example, we can use any TokenClassification model from Hugging Face's library because the task we are trying to solve is `Named Entity Recognition` (NER). 

I chose [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) model because it is a `base` model which means medium computation time on CPU. Plus, BERT architecture is a good choice for NER.

Huggging Faces's `transformers` library provides a convenient way to export the model to ONNX format. You can refer to the [official documentation](https://huggingface.co/docs/transformers/serialization#exporting-transformers-models) for more details.

We use the `bert-base-NER` model as mentioned above and `token-classification` as feature. The `token-classification` is the task we are trying to solve. You can see the list of available by executing the following cell:

In [5]:
from transformers.onnx.features import FeaturesManager

distilbert_features = list(FeaturesManager.get_supported_features_for_model_type("bert").keys())
print(distilbert_features)

['default', 'masked-lm', 'causal-lm', 'sequence-classification', 'token-classification', 'question-answering']


By invoking the conversion script, you have to specify the model name, from a local directory or directly from the Hugging Face's hub. You also need to specify the feature as seen above. The output file will be saved in the `output` directory.

We gave `onnx/` as the output directory. This is where the ONNX model will be saved.

We let the `opset` parameter as default which is defined in the ONNX Config for the model. 

And finally, we let `atol` parameter as default which is 1e-05. This is the tolerance for the numerical precision between the original PyTorch model and the ONNX model.

So here is the command to export the model to ONNX format:

In [4]:
!python -m transformers.onnx --model=dslim/bert-base-NER --feature=token-classification onnx/

Using framework PyTorch: 1.10.2+cu113
Overriding 1 configuration item(s)
	- use_cache -> False
Validating ONNX model...
	-[✓] ONNX model output names match reference model ({'logits'})
	- Validating ONNX Model output "logits":
		-[✓] (2, 8, 9) matches (2, 8, 9)
		-[✓] all values close (atol: 1e-05)
All good, model saved at: onnx/model.onnx


## 💫 Use the ONNX model with Transformers pipeline

Now that we have exported the model to ONNX format, we can use it with the Transformers pipeline.

Let's first import the required packages:

In [6]:
import torch

from onnxruntime import (
    InferenceSession, SessionOptions, GraphOptimizationLevel
)
from transformers import (
    TokenClassificationPipeline, AutoTokenizer, AutoModelForTokenClassification
)

The process is simple:
- Create a session with the ONNX model that allows you to load the model into the pipeline and do inference.
- Override the `forward` method of the pipeline to use the ONNX model.
- Run the pipeline.

#### ⚙️ Create a session with the ONNX model

In [9]:
options = SessionOptions() # initialize session options
options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL

session = InferenceSession(
    "onnx/model.onnx", sess_options=options, providers=["CPUExecutionProvider"]
)

session.disable_fallback() # disable session.run() fallback mechanism, it prevents for a reset of the execution provider

Here we will use only the `CPUExecutionProvider` which is the default execution provider for the ONNX model. You can give one or more execution providers to the session. For example, you can use the `CUDAExecutionProvider` to run the model on GPU. By default, the session will use the one which is available on the machine by starting with the first one in the list.

You can get the list of available execution providers by executing the following cell:

In [16]:
from onnxruntime import get_all_providers

get_all_providers()

['TensorrtExecutionProvider',
 'CUDAExecutionProvider',
 'MIGraphXExecutionProvider',
 'ROCMExecutionProvider',
 'OpenVINOExecutionProvider',
 'DnnlExecutionProvider',
 'NupharExecutionProvider',
 'VitisAIExecutionProvider',
 'NnapiExecutionProvider',
 'CoreMLExecutionProvider',
 'ArmNNExecutionProvider',
 'ACLExecutionProvider',
 'DmlExecutionProvider',
 'RknpuExecutionProvider',
 'CPUExecutionProvider']

#### ⚒️ Create a pipeline with the ONNX model

Now we have a session with the ONNX model ready to use, we can overcharge the original `TokenClassificationPipeline` class to use the ONNX model.

To fully understand what is happening, you can refer to the source code of the [`TokenClassificationPipeline` class](https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/pipelines/token_classification.py#L86). 

We will only override the `forward` and the `preprocess` methods, because the other methods are not dependent of the model format.

In [23]:
class OnnxTokenClassificationPipeline(TokenClassificationPipeline):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
    
    def _forward(self, model_inputs):
        """
        Forward pass through the model. This method is not to be called by the user directly and is only used
        by the pipeline to perform the actual predictions.

        This is where we will define the actual process to do inference with the ONNX model and the session created
        before.
        """

        # This comes from the original implementation of the pipeline
        special_tokens_mask = model_inputs.pop("special_tokens_mask")
        offset_mapping = model_inputs.pop("offset_mapping", None)
        sentence = model_inputs.pop("sentence")

        inputs = {k: v.cpu().detach().numpy() for k, v in model_inputs.items()} # dict of numpy arrays
        outputs_name = session.get_outputs()[0].name # get the name of the output tensor

        logits = session.run(output_names=[outputs_name], input_feed=inputs)[0] # run the session
        logits = torch.tensor(logits) # convert to torch tensor to be compatible with the original implementation

        return {
            "logits": logits,
            "special_tokens_mask": special_tokens_mask,
            "offset_mapping": offset_mapping,
            "sentence": sentence,
            **model_inputs,
        }

    # We need to override the preprocess method because the onnx model is waiting for the attention masks as inputs
    # along with the embeddings.
    def preprocess(self, sentence, offset_mapping=None):
        truncation = True if self.tokenizer.model_max_length and self.tokenizer.model_max_length > 0 else False
        model_inputs = self.tokenizer(
            sentence,
            return_attention_mask=True, # This is the only difference from the original implementation
            return_tensors=self.framework,
            truncation=truncation,
            return_special_tokens_mask=True,
            return_offsets_mapping=self.tokenizer.is_fast,
        )
        if offset_mapping:
            model_inputs["offset_mapping"] = offset_mapping

        model_inputs["sentence"] = sentence

        return model_inputs


#### 🏃 Run the pipeline

We have everything set up, so we can run the pipeline.

As normal, the pipeline will need a tokenizer, a model and a task. We will use the `ner` task.

In [24]:
model_name_from_hub = "dslim/bert-base-NER"

tokenizer = AutoTokenizer.from_pretrained(model_name_from_hub)
model = AutoModelForTokenClassification.from_pretrained(model_name_from_hub)

onnx_pipeline = OnnxTokenClassificationPipeline(
    task="ner", 
    model=model,
    tokenizer=tokenizer,
    framework="pt",
    aggregation_strategy="simple",
)

Let's see if we can run the pipeline:

In [25]:
sequence = "Apple was founded in 1976 by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer"

onnx_pipeline(sequence)

[{'entity_group': 'ORG',
  'score': 0.9978969,
  'word': 'Apple',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.9981243,
  'word': 'Steve Jobs',
  'start': 29,
  'end': 39},
 {'entity_group': 'PER',
  'score': 0.9741297,
  'word': 'Steve Wozniak',
  'start': 41,
  'end': 54},
 {'entity_group': 'PER',
  'score': 0.99970996,
  'word': 'Ronald Wayne',
  'start': 59,
  'end': 71},
 {'entity_group': 'PER',
  'score': 0.86664414,
  'word': 'Wozniak',
  'start': 92,
  'end': 99},
 {'entity_group': 'MISC',
  'score': 0.99852806,
  'word': 'Apple I',
  'start': 102,
  'end': 109}]

**Here it is, the pipeline is running well with the ONNX model!** 🎉

## 🧪 Benchmarking a full pipeline (Optional)

We will benchmark by mesuring the inference time of the pipeline with the ONNX model and the PyTorch model.

We first need to load the PyTorch model and create a pipeline with it.

In [26]:
pytorch_pipeline = TokenClassificationPipeline(
    task="ner", 
    model=model,
    tokenizer=tokenizer,
    framework="pt",
    aggregation_strategy="simple",
)

We will test both pipelines with the same data and 3 different sequence lengths.

In [27]:
sequences = {
    "short_sequence": "Hello my name is Thomas and I love HuggingFace.",
    "medium_sequence": "Winston Churchill was born in 1874 in Stoke-on-Trent, England, to a German father, William and Elizabeth Churchill.",
    "long_sequence": """The first person to reach the summit of Everest was the South Nepalese Everest Gurun, 
                who was a member of the Royal Nepal Expedition, led by the Nepalese Mountaineer, Sir Edmund Hillary. 
                Hilary lived in the Himalayas for a time. He sadly died in 1953 at the age of 88."""
}

Let's time the inference time for each pipeline with the 3 different sequence lengths. We will repeat each iteration 300 times for each sequence length to get a more accurate benchmark.

In [46]:
import timeit

results = [["Sequence Length", "PyTorch", "ONNX"]]
for k, v in sequences.items():
    results.append(
        [k, timeit.timeit(lambda: pytorch_pipeline(v), number=300), timeit.timeit(lambda: onnx_pipeline(v), number=300)]
    )

Let's put everything in a table and compare the results:

In [47]:
from tabulate import tabulate

print(tabulate(results, headers="firstrow"))

Sequence Length      PyTorch      ONNX
-----------------  ---------  --------
short_sequence       13.6261   5.91692
medium_sequence      16.245    7.07232
long_sequence        30.1183  10.7552


Wow that looks great! 🎉

Let's calcualte the ratio between the inference time of the ONNX model and the PyTorch model.

In [49]:
print(f"For a short sequence: ONNX is {results[1][1]/results[1][2]:.2f}x faster than PyTorch")
print(f"For a medium sequence: ONNX is {results[2][1]/results[2][2]:.2f}x faster than PyTorch")
print(f"For a long sequence: ONNX is {results[3][1]/results[3][2]:.2f}x faster than PyTorch")

For a short sequence: ONNX is 2.30x faster than PyTorch
For a medium sequence: ONNX is 2.30x faster than PyTorch
For a long sequence: ONNX is 2.80x faster than PyTorch


We nearly achieved a 3x speedup! 🎉