# HuggingFace Pretrained RoBERTa Inference on Trn1

## Introduction

This notebook demonstrates how to compile and run a HuggingFace 🤗 Transformers RoBERTa model for accelerated inference on Neuron. This notebook will use the [`roberta-large`](https://huggingface.co/roberta-large) model, which is primarily used for masked language modeling, sequence classification, and question and answering. 

This Jupyter notebook should be run on a Trn1 instance (`trn1.2xlarge` or larger).

## Install Dependencies
This tutorial requires the following pip packages:

- `torch-neuronx`
- `neuronx-cc`
- `transformers`

Most of these packages will be installed when configuring your environment using the Trn1 setup guide. The additional dependencies must be installed here:

In [None]:
# !pip install -U transformers

## Compile the model into an AWS Neuron optimized TorchScript

In the following section, we load the model and tokenizer, get s sample input, run inference on CPU, compile the model for Neuron using `torch_neuronx.trace()` and save the optimized model as `TorchScript`.

`torch_neuronx.trace()` expects a tensor or tuple of tensor inputs to use for tracing, so we unpack the tokenzier output. Additionally, the input shape that's used duing compilation must match the input shape that's used during inference. To handle this, we pad the inputs to the maximum size that we will see during inference.

In [1]:
import torch
import torch_neuronx
from transformers import RobertaTokenizer, RobertaModel

In [2]:
# Create the tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
model = RobertaModel.from_pretrained('roberta-large')
model.eval()

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 1024, padding_idx=1)
    (position_embeddings): Embedding(514, 1024, padding_idx=1)
    (token_type_embeddings): Embedding(1, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
            (d

In [11]:
# Get an example input
text = "Replace me by any text you'd like."

encoded_input = tokenizer(
    text,
    max_length=128,
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)

example = (
    encoded_input['input_ids'],
    encoded_input['attention_mask'],
)

In [4]:
# Run inference on CPU
output_cpu = model(*example)

# Compile the model
model_neuron = torch_neuronx.trace(model, example)

# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)

In [6]:
output_cpu

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1413, -0.1705,  0.0704,  ...,  0.1543,  0.0059,  0.0970],
         [ 0.0190, -0.2629, -0.4194,  ...,  0.0526, -0.0291,  0.4754],
         [ 0.0122, -0.1903, -0.1501,  ...,  0.2513, -0.1726,  0.2928],
         ...,
         [-0.2234,  0.1380, -0.3893,  ..., -0.3552, -0.0458,  0.2757],
         [ 0.0411,  0.0823,  0.0195,  ..., -0.0522,  0.1403,  0.0183],
         [ 0.0297,  0.0509,  0.0378,  ..., -0.0723,  0.0957, -0.0018]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 0.1152,  0.7719,  0.3737,  ...,  0.1628,  0.3116, -0.1919]],
       grad_fn=<TanhBackward0>), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)

## Run inference and compare results

In this section we load the compiled model, run inference on Neuron, and compare the CPU and Neuron outputs.

In [5]:
# Load the TorchScript compiled model
model_neuron = torch.jit.load(filename)

This code is loading a pre-trained model in TorchScript format and running inference on it using the `model_neuron` object. The `example` variable is passed as input to the model. The output of the model is stored in the `output_neuron` variable. The code then compares the output of the model on CPU (`output_cpu`) and on Neuron (`output_neuron`) for two different outputs: `last_hidden_state` and `pooler_output`. The `print` statements display the first 10 elements of each output for both CPU and Neuron. This is done to verify that the model is producing the same output on both CPU and Neuron.
# Run inference using the Neuron model
output_neuron = model_neuron(*example)

# Compare the results
print(f"CPU last_hidden_state:    {output_cpu['last_hidden_state'][0][0][:10]}")
print(f"Neuron last_hidden_state: {output_neuron['last_hidden_state'][0][0][:10]}")
print(f"CPU pooler_output:        {output_cpu['pooler_output'][0][:10]}")
print(f"Neuron pooler_output:     {output_neuron['pooler_output'][0][:10]}")

CPU last_hidden_state:    tensor([-0.1413, -0.1705,  0.0704,  0.1655,  0.2243, -0.0064, -0.0461, -0.0680,
         0.2412,  0.1152], grad_fn=<SliceBackward0>)
Neuron last_hidden_state: tensor([-0.1372, -0.1704,  0.0703,  0.1636,  0.2214, -0.0069, -0.0473, -0.0673,
         0.2391,  0.1157])
CPU pooler_output:        tensor([ 0.1152,  0.7719,  0.3737, -0.7019,  0.6962, -0.8989, -0.6623, -0.0373,
        -0.1810,  0.1844], grad_fn=<SliceBackward0>)
Neuron pooler_output:     tensor([ 0.1176,  0.7728,  0.3747, -0.7002,  0.6959, -0.8977, -0.6618, -0.0378,
        -0.1804,  0.1847])


In [12]:
filename

'model.pt'

In [3]:
# Get an example input
text = "Replace me by any text you'd like."

encoded_input = tokenizer(
    text,
    # max_length=128,
    # padding='max_length',
    # truncation=True,
    return_tensors='pt'
)

In [5]:
encoded_input

{'input_ids': tensor([[   0, 9064, 6406,  162,   30,  143, 2788,   47, 1017,  101,    4,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [8]:
# Run inference on CPU
output_cpu = model(**encoded_input)
output_cpu

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1413, -0.1705,  0.0704,  ...,  0.1543,  0.0059,  0.0970],
         [ 0.0190, -0.2629, -0.4194,  ...,  0.0526, -0.0291,  0.4754],
         [ 0.0122, -0.1903, -0.1501,  ...,  0.2513, -0.1726,  0.2928],
         ...,
         [-0.2234,  0.1380, -0.3893,  ..., -0.3552, -0.0458,  0.2757],
         [ 0.0411,  0.0823,  0.0195,  ..., -0.0522,  0.1403,  0.0183],
         [ 0.0297,  0.0509,  0.0378,  ..., -0.0723,  0.0957, -0.0018]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 0.1152,  0.7719,  0.3737,  ...,  0.1628,  0.3116, -0.1919]],
       grad_fn=<TanhBackward0>), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)

In [13]:
encoded_input

{'input_ids': tensor([[   0, 9064, 6406,  162,   30,  143, 2788,   47, 1017,  101,    4,    2,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1,    1,    1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0,

In [19]:
model_neuron = torch_neuronx.trace(model,
                                   (encoded_input['input_ids'],
                                    encoded_input['attention_mask']))

In [20]:
output_neuron = model_neuron(*(encoded_input['input_ids'], encoded_input['attention_mask']))

In [21]:
output_neuron

{'last_hidden_state': tensor([[[-0.1372, -0.1704,  0.0703,  ...,  0.1553,  0.0051,  0.0957],
          [ 0.0197, -0.2613, -0.4140,  ...,  0.0546, -0.0320,  0.4649],
          [ 0.0110, -0.1867, -0.1470,  ...,  0.2475, -0.1698,  0.2860],
          ...,
          [ 0.1602, -0.0143, -0.2360,  ...,  0.0536,  0.0268,  0.1542],
          [ 0.1602, -0.0143, -0.2360,  ...,  0.0536,  0.0268,  0.1542],
          [ 0.1602, -0.0143, -0.2360,  ...,  0.0536,  0.0268,  0.1542]]]),
 'pooler_output': tensor([[ 0.1176,  0.7728,  0.3747,  ...,  0.1608,  0.3138, -0.1932]])}

In [14]:
# Compile the model
model_neuron = torch_neuronx.trace(model, \
                                   (encoded_input['input_ids'],encoded_input['attention_mask']))

# Save the TorchScript for inference deployment
filename = 'model_ori.pt'
torch.jit.save(model_neuron, filename)

In [15]:
model_neuron

NeuronModule(original_name=NeuronModule)

In [18]:
# Load the TorchScript compiled model
model_neuron = torch.jit.load(filename)

# Run inference using the Neuron model
output_neuron = model_neuron(*(encoded_input['input_ids'], encoded_input['attention_mask']))

# Compare the results
print(f"CPU last_hidden_state:    {output_cpu['last_hidden_state'][0][0][:10]}")
print(f"Neuron last_hidden_state: {output_neuron['last_hidden_state'][0][0][:10]}")
print(f"CPU pooler_output:        {output_cpu['pooler_output'][0][:10]}")
print(f"Neuron pooler_output:     {output_neuron['pooler_output'][0][:10]}")

CPU last_hidden_state:    tensor([-0.1413, -0.1705,  0.0704,  0.1655,  0.2243, -0.0064, -0.0461, -0.0680,
         0.2412,  0.1152], grad_fn=<SliceBackward0>)
Neuron last_hidden_state: tensor([-0.1372, -0.1704,  0.0703,  0.1636,  0.2214, -0.0069, -0.0473, -0.0673,
         0.2391,  0.1157])
CPU pooler_output:        tensor([ 0.1152,  0.7719,  0.3737, -0.7019,  0.6962, -0.8989, -0.6623, -0.0373,
        -0.1810,  0.1844], grad_fn=<SliceBackward0>)
Neuron pooler_output:     tensor([ 0.1176,  0.7728,  0.3747, -0.7002,  0.6959, -0.8977, -0.6618, -0.0378,
        -0.1804,  0.1847])
