# Run Hugging Face `facebook/opt-13b` autoregressive sampling on Inf2 & Trn1

In this example we compile and deploy the Hugging Face [facebook/opt-13b](https://huggingface.co/facebook/opt-13b) model for tensor parallel inference on Neuron using the `transformers-neuronx` package.

The example has the following main sections:
1. Set up the Jupyter Notebook
1. Install dependencies
1. Download and construct the model
1. Split the model `state_dict` into multiple files
1. Perform autoregressive sampling using tensor parallelism

This Jupyter Notebook should be run on an Inf2 instance (`inf2.8xlarge` or larger) or a Trn1 instance (`trn1.32xlarge`).

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`facebook-opt-13b-sampling.ipynb`) and launch it. Follow the rest of the instructions in this tutorial. 

## Install Dependencies

This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `transformers`
 - `transformers-neuronx`

Most of these packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/setup-inference.html). The additional dependencies must be installed here:

In [None]:
!pip install git+https://github.com/aws-neuron/transformers-neuronx.git transformers -U

## Download and construct the model

We download and construct the `facebook/opt-13b` model using the Hugging Face `from_pretrained` method.

In [1]:
from transformers.models.opt import OPTForCausalLM

# hf_model = OPTForCausalLM.from_pretrained('facebook/opt-13b', low_cpu_mem_usage=True)
hf_model = OPTForCausalLM.from_pretrained('facebook/opt-6.7b', low_cpu_mem_usage=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Split the model state_dict into multiple files

For the sake of reducing host memory usage, it is recommended to save the model `state_dict` as
multiple files, as opposed to one monolithic file given by `torch.save`. This "split-format"
`state_dict` can be created using the `save_pretrained_split` function. With this checkpoint format,
the Neuron model loader can load parameters to the Neuron device high-bandwidth memory (HBM) directly
by keeping at most one layer of model parameters in the CPU main memory.

To reduce memory usage during compilation and deployment, we cast the attention and mlp to `float16` precision before saving them. We keep the layernorms in `float32`. To do this, we implement a callback function that casts each layer in the model. 

In [2]:
import torch
from transformers_neuronx.module import save_pretrained_split

def amp_callback(model, dtype):
    # cast attention and mlp to low precision only; layernorms stay as f32
    for block in model.model.decoder.layers:
        block.self_attn.to(dtype)
        block.fc1.to(dtype)
        block.fc2.to(dtype)
    model.lm_head.to(dtype)

amp_callback(hf_model, torch.float16)
save_pretrained_split(hf_model, './opt-6.7b-split')

## Perform autoregressive sampling using tensor parallelism

Now we have all of the necessary files for running `facebook/opt-13b` autoregressive sampling. 

To get a large language model working on Inf2 & Trn1, tensor parallelism is used to split weights and data across multiple NeuronCores. Each NeuronCore has 16GB of memory. As a rule of thumb, the total space required per NeuronCore will be at least `2 * number of model parameters` for a `float16` casted model. In reality, the total space required is often greater due to the key value cache, which grows with sequence lenght. This memory usage determines the minimum viable instance size since the amount of memory that will be allocated on one NeuronCore is directly proportional to the parallelism degree (`tp_degree`), or rather the number of physical NeuronCores per instance. The parallelism degree must be chosen to ensure that the memory usage per NeuronCore will be less than the physical 16GB limit. While this determines the minimum instance sizing, further decreasing the memory usage per NeuronCore by using a larger instance and a higher `tp_degree` should result in a faster model

We will use the Neuron `OPTForSampling` class to implement tensor parallelism. The default model config supports sampling up to sequence length 2048, and we set batch size to 2. Tensor-parallelism is enabled through the argument
`tp_degree=2`. Internally, the Neuron tensor manipulator will shard and duplicate tensors to multiple
NeuronCores (2 in this case) to enable tensor-parallel computations on multiple NeuronCores. The model computational graph is compiled by neuronx-cc for optimized inference on Neuron.

In [5]:
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.opt.model import OPTForSampling

# load facebook/opt-13b to NeuronCores with 2-way tensor parallel
# enable float16 casting
neuron_model = OPTForSampling.from_pretrained('./opt-6.7b-split', batch_size=2, tp_degree=2, amp='f16')
neuron_model.to_neuron()



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
.....
Compiler status PASS


2023-May-20 11:58:07.0279 164578:165382 ERROR   ENC:enc_init_global_comm                    [nec_dev 0] global nec_comm is already init'd, g_device_id = 0, g_device_cnt = 2
2023-May-20 11:58:07.0279 164578:165382 ERROR   NRT:nrt_load_collectives                    failed to create global communicator, global_device_id=0, global_device_count=1, ROOT_COMM_ID=localhost:43223)


In [4]:


# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-6.7b')
batch_prompts = [
    "Hello, I'm a language model,",
    "Welcome to Amazon Elastic Compute Cloud,",
]
input_ids = torch.as_tensor([tokenizer.encode(text) for text in batch_prompts])

with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

generated sequences ["</s>Hello, I'm a language model, and this post is gay.\nAre you by any chance the language model from the Matrix?\nYou are the first to point it out!\nHow come you never made your own username? Did you want to seem mysterious?  Now there's no need.\nMystery is always an attractive quality. I did have one before..\nI'm going to pretend we were close acquaintances, then.\nClose enough for your AMA.\nI am a newbie. Do AMA? I have no idea how the site functions.   And my work doesn't allow me to use computers.\nAMA means Ask Me Anything. I asked you questions, and you answered.\nThank you. I appreciate the help.\nHow can people respond faster than I do without computers?\nThis is a good question. I have no idea.\nI'd also like to know your answer.\nA computer is a computer, isn't it?\nIf you don't know your response, maybe it's time to go back to asking the question. Or, y'know, just use a computer, you idiot.\nIt is late, and I need to sleep. You people are rude.\nI 

Larger batch sizes won't fit into an `inf2.8xlarge` or instance. These instances have 32 GB of HBM, and `facebook/opt-13b` has ~26 GB of model parameters in `float16`. With batch size 3, after storing model parameters and key-value caches, there will be less than 1 GB of HBM left, which is not enough for storing code and temporary data generated during the sampling computation. 

To use larger batch sizes, please consider using an `inf2.48xlarge` or  `trn1.32xlarge`. You can also try using a larger tensor parallelism degree, such as 8, on an `inf2.48xlarge` or a `trn1.32xlarge`. The `facebook/opt-13b` number of attention heads is 40, so the tensor parallelism degree must be a divisor of 40 and be supported on the Inf2 or Trn1 instance.