# P-tuning High-Performance Inference Deployment with FasterTransformer and Triton Inference Server

This notebook walks you through the process of deploying a p-tuned model trained by NeMo using FasterTransformer and the Triton inference server for high-performance inference.


## Pre-requisite

1. A p-tuning model trained with NeMo. Refer to [Multitask_Prompt_and_PTuning.ipynb](Multitask_Prompt_and_PTuning.ipynb) and our blog [How to Create a Custom Language Model](https://developer.nvidia.com/blog/how-to-create-a-custom-language-model/) for examples and walk-through guidance.
2. Access to a compatible NeMo GPT3 checkpoint. There are currently several public NeMo GPT3 checkpoints on HuggingFace:

- 5B: https://huggingface.co/nvidia/nemo-megatron-gpt-5B
- 20B: https://huggingface.co/nvidia/nemo-megatron-gpt-20B

You should ensure the same model is used both for training and inference. 

3. You will need 2x NVIDIA Volta, Ampere or Hopper GPUs to work with the 5B model, and  4x NVIDIA Volta Ampere or Hopper GPUs to work with the 20B model.
4. This notebook was tested with the NeMo 23.02 container, but you can also try later releases should they become available. Download and run this container with:

```
docker run --gpus=all -u $(id -u ${USER}):$(id -g ${USER}) --rm -it --net=host nvcr.io/nvidia/nemo:23.02 bash
```

Then from within the container interactive bash environment, start Jupyter lab:
```
cd /myworkspace
jupyter lab --ip 0.0.0.0 --allow-root --port=8888
```

From within the Jupyter lab environment, you can upload this notebook.

## P-tuning customization recap

p-tuning is a parameter efficient customization technique, during which a small auxiliary model (usually LSTM or MLP) is used to learn a set of “soft tokens”. These soft tokens (aka. virtual tokens) are usually prepended  to the user’s prompt to form the complete input to a base LLM. During the training, the base LLM weights are fixed, and only the prompt encoder weights are learned. After training, the prompt encoder is discarded and only the final soft prompt tokens are retained for deployment. Think of the soft prompt tokens akin to an instruction, such as “Summarize the following article”, except that these prompt tokens are not in natural language but in a continuous space, and are learned and optimized for the task at hand. 


The nature of p-tuning technique implies that the same base model can be used to serve both customized use cases and non-customized use cases. It can serve two type of requests in the same fashion.
- Client 1 sends a regular request with a tokenized prompt, in the form of integer indices. The model uses these indices and its embedding table to map tokens to a continuous input. 
- Client 2 sends a request for a p-tuned model. The p-tuned model in fact is not deployed at the server side. Instead, the client sends in a prompt which contain the prepended virtual token IDs, together with a mini embedding table containing the learned embedding of those soft tokens. 

With this paradigm, the p-tuning inference logic is handled primarily at the client side, or a middleman server. At the backend, the base model mostly just does “business as usual”, serving both customized and non-customized requests in an almost identical manner. 


## 1. Download public NeMo-Megatron 5B GPT model from HuggingFace

The below code downloads the the 5B model.

In [None]:
%%bash
apt update && apt install git-lfs
git lfs install
rm -rf nemo-megatron-gpt-5B
git clone https://huggingface.co/nvidia/nemo-megatron-gpt-5B

## 2. Deploy foundation model with Triton

The first step is to build a Triton container with the FasterTransformer backend. 

### 2.1 Build Triton FT backend

Follow the instruction at `https://github.com/triton-inference-server/fastertransformer_backend` to build a Triton container with the FasterTransformer backend. 

The below command should be executed from the base OS environment, and not from within a docker environment.

```
git clone https://github.com/triton-inference-server/fastertransformer_backend

export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.12
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}

cd fastertransformer_backend
python3 docker/create_dockerfile_and_build.py --triton-version 22.12
```

Upon success, a docker image named `tritonserver_with_ft:latest` will be created. 

Once ready, execute the following command, also from the base OS environment, to convert .nemo checkpoint to a FasterTransformer compatible format. 

We first clone the FasterTransformer library.
```
git clone https://github.com/NVIDIA/FasterTransformer
```

Next, use FasterTransformer code to convert the model.

```
mkdir -p <Path_to_model_repository>

docker run --rm \
    --gpus all \
    --shm-size=16GB \
    -v <Path_to_FasterTransformer>:/FasterTransformer \
    -v <Path_to_nemo-megatron-gpt>:/checkpoints \
    -v <Path_to_model_repository>:/model_repository \
   tritonserver_with_ft:latest \
    bash -c "export PYTHONPATH=/FasterTransformer:${PYTHONPATH} && pip install nemo_toolkit['all'] && wget https://raw.githubusercontent.com/triton-inference-server/fastertransformer_backend/main/all_models/gpt/fastertransformer/config.pbtxt && \
    python3 /FasterTransformer/examples/pytorch/gpt/utils/nemo_ckpt_convert.py \
        --in-file /checkpoints/nemo_gpt5B_bf16_tp2.nemo \
        --infer-gpu-num 2 \
        --saved-dir /model_repository/gpt3_5b \
        --weight-data-type fp16 \
        --load-checkpoints-to-cpu 0"

```

Herein you should replace the following path with absolute paths:
- `<Path_to_FasterTransformer>`: absolute path to the FasterTransformer directory
- `<Path_to_nemo-megatron-gpt>`: absolute path to the NeMo Megatron GPT model
- `<Path_to_model_repository>`: absolute path to an empty directory which store the output of the conversion. Note: ensure that the container can write to this directory.


In addition, if you are using the GPT3 20B model, you should also replace `--infer-gpu-num` to equal the TP degree (Typically, TP=4 for the 20B model).

### 2.2 Fix configuration

For the 5B model, overwrite the model configuration with  the following config file. Note `<Path_to_model_repository>` is the absolute path to the directory which store the output of the conversion.

In [None]:
%%writefile <Path_to_model_repository>/gpt3_5b/config.pbtxt
name: "gpt3_5b"
max_batch_size: 256
input {
  name: "input_ids"
  data_type: TYPE_UINT32
  dims: -1
}
input {
  name: "input_lengths"
  data_type: TYPE_UINT32
  dims: 1
  reshape {
  }
}
input {
  name: "request_output_len"
  data_type: TYPE_UINT32
  dims: -1
}
input {
  name: "runtime_top_k"
  data_type: TYPE_UINT32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "runtime_top_p"
  data_type: TYPE_FP32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "beam_search_diversity_rate"
  data_type: TYPE_FP32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "temperature"
  data_type: TYPE_FP32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "len_penalty"
  data_type: TYPE_FP32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "repetition_penalty"
  data_type: TYPE_FP32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "random_seed"
  data_type: TYPE_UINT64
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "is_return_log_probs"
  data_type: TYPE_BOOL
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "is_return_context_embeddings"
  data_type: TYPE_BOOL
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "beam_width"
  data_type: TYPE_UINT32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "start_id"
  data_type: TYPE_UINT32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "end_id"
  data_type: TYPE_UINT32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "stop_words_list"
  data_type: TYPE_INT32
  dims: 2
  dims: -1
  optional: true
}
input {
  name: "bad_words_list"
  data_type: TYPE_INT32
  dims: 2
  dims: -1
  optional: true
}
input {
  name: "prompt_learning_task_name_ids"
  data_type: TYPE_UINT32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "request_prompt_embedding"
  data_type: TYPE_FP16
  dims: -1
  dims: -1
  optional: true
}
input {
  name: "request_prompt_lengths"
  data_type: TYPE_UINT32
  dims: 1
  reshape {
  }
  optional: true
}
input {
  name: "request_prompt_type"
  data_type: TYPE_UINT32
  dims: 1
  reshape {
  }
  optional: true
}
output {
  name: "output_ids"
  data_type: TYPE_UINT32
  dims: -1
  dims: -1
}
output {
  name: "sequence_length"
  data_type: TYPE_UINT32
  dims: -1
}
output {
  name: "cum_log_probs"
  data_type: TYPE_FP32
  dims: -1
}
output {
  name: "output_log_probs"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "context_embeddings"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "2-gpu"
parameters {
  key: "data_type"
  value {
    string_value: "fp16"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value {
    string_value: "0"
  }
}
parameters {
  key: "int8_mode"
  value {
    string_value: "0"
  }
}
parameters {
  key: "model_checkpoint_path"
  value {
    string_value: "/model_repository/gpt3_5b/2-gpu"
  }
}
parameters {
  key: "model_type"
  value {
    string_value: "GPT"
  }
}
parameters {
  key: "pipeline_para_size"
  value {
    string_value: "1"
  }
}
parameters {
  key: "tensor_para_size"
  value {
    string_value: "2"
  }
}
backend: "fastertransformer"
model_transaction_policy {
}


### 2.3 Deploy model

Once converted, run the following command to deploy the model from the OS environment.

```
docker run --rm \
    --name triton-inference-server \
    --gpus all \
    -p 8000-8002:8000-8002 \
    -v <Path_to_model_repository>:/model_repository \
     tritonserver_with_ft:latest \
    bash -c 'export CUDA_VISIBLE_DEVICES=0,1 && \
    tritonserver --model-repository /model_repository'
```

Herein you should replace the following path with absolute paths:
- `<Path_to_model_repository>`: absolute path to the model repository


Upon successfull deployment, you should observed the model loaded and ready to serve:

```
│I0309 06:56:20.186185 1 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
│I0309 06:56:20.186468 1 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
│I0309 06:56:20.227948 1 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
```

## 3. Preprocessing at client side

Now the Triton inference server is up and ready to serve the base model, it's time to look at setting up the p-tuning logic at the client side. By and large, for each p-tuning task, we want to read the virtual tokens and append them to each of the request.

### 3.1 Read the virtual prompt config file and weights

First, we extract the NeMo model file, and read the config and the weights. By unpacking the .nemo file you should be able to see its two main components:
- model_config.yaml: contains the model configuration
- model_weights.ckpt: contains the model weights, in one of the subdirectory named `mp_rank_xx`.

Here, we assume a `squad.nemo` model was trained according to [Multitask_Prompt_and_PTuning.ipynb](Multitask_Prompt_and_PTuning.ipynb).

In [None]:
!mkdir squad-ptune-model
!tar -xvf 'squad.nemo' -C squad-ptune-model

In [None]:
!pip install omegaconf torch tritonclient[http] transformers

In [4]:
from omegaconf import OmegaConf
config = OmegaConf.load("squad-ptune-model/model_config.yaml")


NeMo framework support multi-task p-tuning, hence the configuration file can contain multiple tasks. We are only interested in the "squad" task here.

In [5]:
taskname = "squad"

for t in config['task_templates']:
    if t['taskname'] == taskname:
        template = t

In [6]:
print(t)

{'taskname': 'squad', 'prompt_template': '<|VIRTUAL_PROMPT_0|> Context: {context}\n\nQuestion: {question}\n\nAnswer:{answer}', 'total_virtual_tokens': 10, 'virtual_token_splits': [10], 'truncate_field': None, 'answer_only_loss': False, 'answer_field': 'answer'}


#### Read the virtual embedding table

In [8]:
import numpy as np
import torch

prompt_table = torch.load("squad-ptune-model/mp_rank_00/model_weights.ckpt", map_location=torch.device('cpu'))['prompt_table']
fp16_dtype = prompt_table[f'prompt_table.{taskname}.prompt_embeddings.weight'].to(torch.float16)
prompt_table[f'prompt_table.{taskname}.prompt_embeddings.weight'] = fp16_dtype


In [9]:
prompt_embedding = prompt_table[f'prompt_table.{taskname}.prompt_embeddings.weight']
prompt_length = prompt_embedding.shape[0]


In [10]:
prompt_embedding.shape

torch.Size([10, 4096])

In [11]:
PROMPT_TYPE = 3
request_prompt_lengths = prompt_length * np.ones([1, 1]).astype(np.uint32)
request_prompt_embedding = np.expand_dims(prompt_embedding, axis=0)
request_prompt_type = PROMPT_TYPE * np.ones([1, 1]).astype(np.uint32)


### 3.2 Patch tokenizer

We take the GPT2 tokenizer, which is used in the NeMo Megatron model. We then patch this tokenizer with additional special tokens.

In [12]:
from transformers import GPT2Tokenizer

pseudo_tokens = [
    f'<prompt_{str(num)}>' for num in range(prompt_length)
]

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'additional_special_tokens': pseudo_tokens})

10

In [35]:
example = {'taskname': 'squad',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answer': 'Saint Bernadette Soubirous'}

In [20]:

prompt_template = template['prompt_template']
prompt = prompt.replace('<|VIRTUAL_PROMPT_0|>', ''.join(pseudo_tokens))

prompt = prompt.replace("{context}", example['context'])
prompt = prompt.replace("{question}", example['question'])
prompt = prompt.replace("{answer}", "")

print(prompt)
input_tokens = tokenizer.tokenize(prompt)
input_ids = tokenizer.convert_tokens_to_ids(input_tokens)

<prompt_0><prompt_1><prompt_2><prompt_3><prompt_4><prompt_5><prompt_6><prompt_7><prompt_8><prompt_9> Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

Answer:


In [None]:
input_ids

As we can see, the extra virtual token are appended at the beginning of the prompt.

### 3.3 Send to Triton

Finally, with everything ready, we can now put the prompt and other hyperparameter into a proper Triton request, and send to the Triton inference server.

In [27]:
import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype

In [28]:
def fill_input(input_name: str, data) -> httpclient.InferInput:
    """
    Converts an input from a numpy array to a Triton compatible data type.

    Args:
        input_name: The name of the input parameter to send to Triton.
        data: The full data field as a numpy array.

    Returns:
        Returns the converted field as a Triton input.
    """
    infer_input = httpclient.InferInput(
        input_name,
        data.shape,
        np_to_triton_dtype(data.dtype)
    )
    infer_input.set_data_from_numpy(data)
    return infer_input


In [29]:
# LLM hyper parameters

topk=1
topp=0.9
temperature = 1.0
len_penalty = 1.0
repetition_penalty = 1.0
random_seed = 0
beam_width = 1
max_output_len = 32

In [30]:
input_start_ids = np.array([input_ids]).astype(np.uint32)
input_length = np.array([[len(input_start_ids[0])]]).astype(np.uint32)
output_len = np.ones_like(input_length).astype(np.uint32) * max_output_len

runtime_top_k = np.array([[topk]]).astype(np.uint32)
runtime_top_p = np.array([[topp]]).astype(np.float32)
beam_search_diversity_rate = np.array([[0.0]]).astype(np.float32)
temperature = np.array([[temperature]]).astype(np.float32)
len_penalty = np.array([[len_penalty]]).astype(np.float32)
repetition_penalty = np.array([[repetition_penalty]]).astype(np.float32)
random_seed = np.array([[random_seed]]).astype(np.uint64)
is_return_log_probs = np.array([[True]]).astype(bool)
beam_width = np.array([[beam_width]]).astype(np.uint32)
start_ids = np.array([[50256]]).astype(np.uint32)
end_ids = np.array([[50256]]).astype(np.uint32)
bad_words_list = np.concatenate([np.zeros([input_start_ids.shape[0], 1, 1]).astype(
    np.int32), (-1 * np.ones([input_start_ids.shape[0], 1, 1])).astype(np.int32)], axis=1)
stop_word_list = np.concatenate([np.zeros([input_start_ids.shape[0], 1, 1]).astype(
    np.int32), (-1 * np.ones([input_start_ids.shape[0], 1, 1])).astype(np.int32)], axis=1)

inputs = [
    fill_input("input_ids", input_start_ids),
    fill_input("input_lengths", input_length),
    fill_input("request_output_len", output_len),
    fill_input("runtime_top_k", runtime_top_k),
    fill_input("runtime_top_p", runtime_top_p),
    fill_input("beam_search_diversity_rate", beam_search_diversity_rate),
    fill_input("temperature", temperature),
    fill_input("len_penalty", len_penalty),
    fill_input("repetition_penalty", repetition_penalty),
    fill_input("random_seed", random_seed),
    fill_input("is_return_log_probs", is_return_log_probs),
    fill_input("beam_width", beam_width),
    fill_input("start_id", start_ids),
    fill_input("end_id", end_ids),
    fill_input("bad_words_list", bad_words_list),
    fill_input("stop_words_list", stop_word_list),
    fill_input("request_prompt_embedding", request_prompt_embedding),
    fill_input("request_prompt_lengths", request_prompt_lengths),
    fill_input("request_prompt_type", request_prompt_type)
]


In [31]:
with httpclient.InferenceServerClient("localhost:8000") as client:
    result = client.infer("gpt3_5b", inputs)
    output = result.as_numpy('output_ids').squeeze()

In [32]:
output

array([21947,    25, 17340, 20221,    11,   262,  1524,   468,   257,
        7835,  2095,    13,  1629,   404,   262,  8774, 11819,   338,
        3869, 29500,   318,   257, 10861, 15207,   286,   262,  5283,
        5335,    13, 34528,   287,  2166,   286,   262,  8774, 11819,
         290,  6476,   340,    11,   318,   257, 15317, 15207,   286,
        1951,   351,  5101,   510, 49309,   351,   262,  8177,   366,
       37522,   578,  1215,  2185, 16543,  2516,  1911,  7406,   284,
         262,  8774, 11819,   318,   262, 32520,  3970,   286,   262,
       17380,  8894,    13, 34528,  2157,   262, 37792,  3970,   318,
         262, 10299, 33955,    11,   257, 37919,  1295,   286, 11443,
         290, 14580,    13,   632,   318,   257, 30069,   286,   262,
        7128, 33955,   379,   406,   454,  8906,    11,  4881,   810,
         262,  5283,  5335,  1128,  7241,   306,  4120,   284,  9281,
        6206,   324,  5857,   311, 12944,   343,   516,   287,  1248,
        3365,    13,

Lastly, we use the tokenizer to decode the Triton output.

In [33]:
response = tokenizer.decode(output)
response.replace("<|endoftext|>","")

'Context: Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.\n\nQuestion: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?\n\nAnswer:Saint Bernadette Soubirous'