# HuggingFace Pretrained GPT2 Feature Extraction on Trn1

## Introduction

This notebook demonstrates how to compile and run a HuggingFace 🤗 Transformers GPT2 model for accelerated feature extraction on Neuron. This notebook will use the [`gpt2`](https://huggingface.co/gpt2) model, which is primarily used for text generation and feature extraction. 

This Jupyter notebook should be run on a Trn1 instance (`trn1.2xlarge` or larger).

## Install Dependencies
This tutorial requires the following pip packages:

- `torch-neuronx`
- `neuronx-cc`
- `transformers`

Most of these packages will be installed when configuring your environment using the Trn1 setup guide. The additional dependencies must be installed here:

In [1]:
# # !pip install "transformers < 4.21.0"
# !pip install git+https://github.com/aws-neuron/transformers-neuronx.git transformers -U

## Compile the model into an AWS Neuron optimized TorchScript

In the following section, we load the model and tokenizer, get s sample input, run inference on CPU, compile the model for Neuron using `torch_neuronx.trace()` and save the optimized model as `TorchScript`.

`torch_neuronx.trace()` expects a tensor or tuple of tensor inputs to use for tracing, so we unpack the tokenzier output. Additionally, the input shape that's used duing compilation must match the input shape that's used during inference. To handle this, we pad the inputs to the maximum size that we will see during inference. We also use define a basic wrapper that ensures the `input_ids` and `attention_mask` kwargs are passed into the GPT2 model in the correct positions without requiring a dictionary. 

In [1]:
from transformers_neuronx.gpt2.model import GPT2ForHuggingFaceSampling


In [1]:
import torch
import torch_neuronx
from transformers import GPT2Tokenizer, GPT2Model


# Create a wrapper to correctly order the inputs
class GPT2Neuron(torch.nn.Module):
    """
    Ensures that `input_ids` and `attention_mask` are passed into the GPT2
    model in the correct positions without requiring a dictionary.
    """

    def __init__(self, model) -> None:
        super().__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids=input_ids, attention_mask=attention_mask)


# Create the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token # Define the padding token value
gpt2 = GPT2Model.from_pretrained('gpt2', torchscript=True)
model = GPT2Neuron(gpt2)

model.eval()

# Get an example input
text = "Replace me by any text you'd like."

encoded_input = tokenizer(
    text,
    max_length=128,
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)

example = (
    encoded_input['input_ids'],
    encoded_input['attention_mask'],
)

# Run inference on CPU
output_cpu = model(*example)

# Compile the model using the wrapper
model_neuron = torch_neuronx.trace(model, example)

# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)

2023-05-21 06:37:48.000807: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-2.6.0.19+3d819e565/MODULE_6212939987563883973/MODULE_0_SyncTensorsGraph.3_6212939987563883973_ip-10-0-12-88-19a174f5-101886-5fbe29143eebc/1244f319-60d9-4315-b893-dffcbe2ad49a/MODULE_0_SyncTensorsGraph.3_6212939987563883973_ip-10-0-12-88-19a174f5-101886-5fbe29143eebc.neff. Exiting with a successfully compiled graph
2023-05-21 06:37:49.000211: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-2.6.0.19+3d819e565/MODULE_5991913339558389757/MODULE_1_SyncTensorsGraph.3_5991913339558389757_ip-10-0-12-88-4b94cd12-101886-5fbe2915c7752/35cffdbf-ee1b-439e-b1e0-b9a629dc18c2/MODULE_1_SyncTensorsGraph.3_5991913339558389757_ip-10-0-12-88-4b94cd12-101886-5fbe2915c7752.neff. Exiting with a successfully compiled graph
2023-05-21 06:37:49.000277: INFO ||NCC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/USER_neuroncc-2.6.0.19+3d819e56

## Run inference and compare results

In this section we load the compiled model, run feature extraction inference on Neuron, and compare the CPU and Neuron outputs.

In [3]:
filename

'model.pt'

In [4]:
example

(tensor([[ 3041,  5372,   502,   416,   597,  2420,   345,  1549,   588,    13,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
          50256, 50256, 50256, 50256, 50

In [5]:
transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16

NeuronModule(original_name=NeuronModule)

In [6]:
# Load the TorchScript compiled model
model_neuron = torch.jit.load(filename)

# Run inference using the Neuron model
output_neuron = model_neuron(*example)

# Compare the results
print(f"CPU outputs:    {output_cpu[0][0][0][:10]}")
print(f"Neuron outputs: {output_neuron[0][0][0][:10]}")

CPU outputs:    tensor([ 0.1629, -0.2166, -0.1410,  0.0061, -0.0623, -0.2181, -0.8142, -0.0920,
        -0.3586,  0.0676], grad_fn=<SliceBackward0>)
Neuron outputs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])


In [15]:
import torch
import torch_neuronx
import transformers
from transformers import GPT2Tokenizer, GPT2Model

In [42]:
gpt2 = GPT2Model.from_pretrained('gpt2', torchscript=True)

In [45]:
import torch
from transformers_neuronx.module import save_pretrained_split

class GPT2Neuron(torch.nn.Module):
    """
    Ensures that `input_ids` and `attention_mask` are passed into the GPT2
    model in the correct positions without requiring a dictionary.
    """

    def __init__(self, model) -> None:
        super().__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids=input_ids, attention_mask=attention_mask)


# Create the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token # Define the padding token value
gpt2 = GPT2Model.from_pretrained('gpt2', torchscript=True)
# model = GPT2Neuron(gpt2)
save_pretrained_split(gpt2, './gpt2-split')

In [50]:
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.gpt2.model import GPT2ForHuggingFaceSampling



# load facebook/opt-13b to NeuronCores with 2-way tensor parallel
# enable float16 casting
neuron_model = GPT2ForHuggingFaceSampling.from_pretrained('./gpt2-split', batch_size=1, tp_degree=2, amp='f32')
neuron_model.to_neuron()

In [16]:
class GPT2Neuron(torch.nn.Module):
    """
    Ensures that `input_ids` and `attention_mask` are passed into the GPT2
    model in the correct positions without requiring a dictionary.
    """

    def __init__(self, model) -> None:
        super().__init__()
        self.model = model

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids=input_ids, attention_mask=attention_mask)


# Create the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token # Define the padding token value
gpt2 = GPT2Model.from_pretrained('gpt2', torchscript=True)
model = GPT2Neuron(gpt2)

model.eval()

GPT2Neuron(
  (model): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1,

In [22]:
# for t in model.parameters():
#     print(t.dtype)

In [23]:
# model = model.to(torch.bfloat16)
# transformers.modeling_utils.get_parameter_dtype(model)

In [24]:
# transformers.modeling_utils.get_parameter_dtype(model) = lambda x: torch.bfloat16

In [26]:
# Create a wrapper to correctly order the inputs
model.to(torch.bfloat16)
# Get an example input
text = "Replace me by any text you'd like."

encoded_input = tokenizer(
    text,
    max_length=128,
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)

example = (
    encoded_input['input_ids'],
    encoded_input['attention_mask'],
)

# Run inference on CPU
# output_cpu = model(*example)

# Compile the model using the wrapper
model_neuron = torch_neuronx.trace(model, example)

# Save the TorchScript for inference deployment
filename = 'model.pt'
torch.jit.save(model_neuron, filename)

In [40]:
# Load the TorchScript compiled model
model_neuron = torch.jit.load(filename)

# Run inference using the Neuron model
output_neuron = model_neuron(*example)

# Compare the results
# print(f"CPU outputs:    {output_cpu[0][0][0][:10]}")
print(f"Neuron outputs: {output_neuron[0][0][0][:10]}")

Neuron outputs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       dtype=torch.bfloat16)
