# Run Hugging Face `EleutherAI/gpt-j-6B` autoregressive sampling on Inf2 & Trn1 with Data Parallel

To make the most of this tutorial and use (24 cores) in three processes, use an Inf2.48xlarge or trn1.32xlarge.
If you are using Inf2.24xlarge, modify the last section to run only two processes (16 cores)

Verify that this Jupyter notebook is running the Python kernel environment that was set up according to the [PyTorch Installation Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). You can select the kernel from the 'Kernel -> Change Kernel' option on the top of this Jupyter notebook page.

## Set up the Jupyter Notebook

The following steps set up Jupyter Notebook and launch this tutorial:
1. Clone the [AWS Neuron Samples](https://github.com/aws-neuron/aws-neuron-samples) repo to your instance using
```
git clone https://github.com/aws-neuron/aws-neuron-samples.git
```
2. Navigate to the `transformers-neuronx` inference samples folder
```
cd aws-neuron-samples/torch-neuronx/transformers-neuronx/inference
```
3. Follow the instructions in [Jupyter Notebook QuickStart](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html) to run Jupyter Notebook on your instance.
4. Locate this tutorial in your Jupyter Notebook session (`gptj-6b-sampling.ipynb`) and launch it. Follow the rest of the instructions in this tutorial. 

## Install Dependencies

This tutorial requires the following pip packages:

 - `torch-neuronx`
 - `neuronx-cc`
 - `transformers`
 - `transformers-neuronx`

Most of these packages will be installed when configuring your environment using the [torch-neuronx inference setup guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html#setup-torch-neuronx). The additional dependencies must be installed here:

In [None]:
!pip install transformers-neuronx transformers

## Download and construct the model

We download and construct the `EleutherAI/gpt-j-6B` model using the Hugging Face `from_pretrained` method.

In [None]:
from transformers.models.auto import AutoModelForCausalLM

hf_model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-j-6B', low_cpu_mem_usage=True)

## Split the model state_dict into multiple files

For the sake of reducing host memory usage, it is recommended to save the model `state_dict` as
multiple files, as opposed to one monolithic file given by `torch.save`. This "split-format"
`state_dict` can be created using the `save_pretrained_split` function. With this checkpoint format,
the Neuron model loader can load parameters to the Neuron device high-bandwidth memory (HBM) directly
by keeping at most one layer of model parameters in the CPU main memory.

To reduce memory usage during compilation and deployment, we cast the attention and mlp to `float16` precision before saving them. We keep the layernorms in `float32`. To do this, we implement a callback function that casts each layer in the model. 

In [None]:
import torch
from transformers_neuronx.module import save_pretrained_split

def amp_callback(model, dtype):
    # cast attention and mlp to low precisions only; layernorms stay as f32
    for block in model.transformer.h:
        block.attn.to(dtype)
        block.mlp.to(dtype)
    model.lm_head.to(dtype)

amp_callback(hf_model, torch.float16)
save_pretrained_split(hf_model, './gptj-6b-split')

Utilizing more cores is possible by running multiple processes (Data Parallel)

# Data Parallel Optimization for Throughput

This is an example to show case that it is possible to run the same program in multiple processes. For example running 2 or 3 proceeses with 8 cores each utiizes 24 cores instead of previously only 16 cores. This is useful to increase throughput. This code below runs a batch size of 64. 

In [None]:
def load_model_infer():
    # load model to NeuronCores with 8-way tensor parallel and DP
    load_compile_time = time.time()
    neuron_model = GPTJForSampling.from_pretrained('./gptj-6b-split', n_positions=1024, batch_size=64, tp_degree=8, amp='f16')
    neuron_model.to_neuron()
    load_compile_elapsed = time.time() - load_compile_time
    print(f'Model load & compile time in a single process  {load_compile_elapsed} seconds')

    # construct a tokenizer and encode prompt text
    tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-j-6B')

    batch_prompts = [
        "I am specialized at sentence generation language models,", 
    ]
    batch_prompts = batch_prompts * 64

    input_ids = torch.as_tensor([tokenizer.encode(text) for text in batch_prompts])


    with torch.inference_mode():
        # warmup
        generated_sequences = neuron_model.sample(input_ids, sequence_length=1024)
        
        start = time.time()
        for i in range(2):
            generated_sequences = neuron_model.sample(input_ids, sequence_length=1024)
        elapsed = (time.time() - start) / 2

        generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
    print(f'Averaged Latency for one inference {elapsed} seconds')


In [None]:
# from multiprocessing import Pool
# If runtime is busy, shutdown any other running notebook and retry again.
import os
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.gptj.model import GPTJForSampling
from multiprocessing import Process
if __name__ == '__main__':
    os.environ['NEURON_RT_NUM_CORES']='8'
    total_start = time.time()
    p1 = Process(target=load_model_infer)
    p2 = Process(target=load_model_infer)
    p3 = Process(target=load_model_infer)
    p1.start()
    p2.start()
    p3.start()
    p1.join()
    p2.join()
    p3.join()
    total_elapsed = time.time() - total_start
    print(f'total processes time including compilation finished in {total_elapsed} seconds')
    print(f'TPS {(30/total_elapsed)*64} ')
    p1.terminate()
    p2.terminate()
    p3.terminate()