Skip to content

Commit

Permalink
Merge branch 'main' into adithyare/unfused_lora
Browse files Browse the repository at this point in the history
  • Loading branch information
arendu committed Apr 23, 2024
2 parents df45322 + d9f2077 commit d4e836e
Show file tree
Hide file tree
Showing 38 changed files with 2,782 additions and 268 deletions.
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ WORKDIR /workspace/
RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
cd Megatron-LM && \
git checkout 36e9b6bf3d8034b10c9bbd9fc357c2df2bd1515c && \
git cherry-pick -n e69187bc3679ea5841030a165d587bb48b56ee77 && \
pip install .

# Performance optimizations for distributed optimizer: https://github.com/NVIDIA/apex/pull/1771
Expand Down Expand Up @@ -133,7 +134,7 @@ RUN pip install flash-attn
# install numba for latest containers
RUN pip install numba>=0.57.1
# install ammo
RUN pip install nvidia-ammo~=0.7.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir
RUN pip install nvidia-ammo~=0.9.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir

# copy nemo source into a scratch image
FROM scratch as nemo-src
Expand Down
2 changes: 1 addition & 1 deletion Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ pipeline {

stage('AMMO installation') {
steps {
sh 'pip install nvidia-ammo~=0.7.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir'
sh 'pip install nvidia-ammo~=0.9.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir'
}
}

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ For more information, browse the developer docs for your area of interest in the
nlp/models
nlp/machine_translation/machine_translation
nlp/megatron_onnx_export
nlp/quantization
nlp/api


Expand Down
46 changes: 31 additions & 15 deletions docs/source/nlp/quantization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,35 +3,35 @@
Quantization
==========================

Post Training Quantization (PTQ)
Post-Training Quantization (PTQ)
--------------------------------

PTQ enables deploying a model in a low-precision format -- FP8, INT4 or INT8 -- for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant and INT4 AWQ.
PTQ enables deploying a model in a low-precision format -- FP8, INT4, or INT8 -- for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.

Model quantization has two primary benefits: reduced model memory requirements and increased inference throughput.

In NeMo, quantization is enabled by the Nvidia AMMO library -- a unified algorithmic model optimization & deployment toolkit.

The quantization process consists of the following steps:

1. Loading a model checkpoint using appropriate parallelism strategy for evaluation
1. Loading a model checkpoint using an appropriate parallelism strategy
2. Calibrating the model to obtain appropriate algorithm-specific scaling factors
3. Producing output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).
3. Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).

Loading models requires using AMMO spec defined in `megatron.core.deploy.gpt.model_specs module <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/deploy/gpt/model_specs.py>`_. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also soon to be the part of NeMo project and ``nemo.deploy`` and ``nemo.export`` modules, see https://github.com/NVIDIA/NeMo/pull/8690.
Loading models requires using an AMMO spec defined in `megatron.core.inference.gpt.model_specs.py <https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/inference/gpt/model_specs.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.

Quantization algorithm can also be conveniently set to ``"null"`` to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.


Example
^^^^^^^
The example below shows how to quantize the Llama2 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is intended for serving using 2 GPUs specified with ``export.inference_tensor_parallel`` parameter.
The example below shows how to quantize the Llama2 70b model into FP8 precision, using tensor parallelism of 8 on a single DGX H100 node. The quantized model is designed for serving using 2 GPUs specified with the ``export.inference_tensor_parallel`` parameter.

The script should be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the ``mpirun`` command below.
The script must be launched correctly with the number of processes equal to tensor parallelism. This is achieved with the ``torchrun`` command below:

.. code-block:: bash
mpirun -n 8 python examples/nlp/language_modeling/megatron_llama_quantization.py \
torchrun --nproc-per-node 8 examples/nlp/language_modeling/megatron_llama_quantization.py \
model_file=llama2-70b-base-bf16.nemo \
tensor_model_parallel_size=8 \
pipeline_model_parallel_size=1 \
Expand All @@ -57,31 +57,47 @@ The output directory stores the following files:
└── tokenizer_config.yaml
The TensorRT-LLM engine can be build with ``trtllm-build`` command, see `TensorRT-LLM documentation <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#fp8-post-training-quantization>`_.
The TensorRT-LLM engine can be conveniently built and run using ``TensorRTLLM`` class available in ``nemo.export`` submodule:

.. code-block:: python
from nemo.export import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
trt_llm_exporter.export(
nemo_checkpoint_path="llama2-70b-base-fp8-qnemo",
model_type="llama",
)
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
Alternatively, it can also be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#fp8-post-training-quantization>`_:

.. code-block:: bash
trtllm-build \
--checkpoint_dir llama2-70b-base-fp8-qnemo \
--output_dir engine_dir \
--output_dir /path/to/trt_llm_engine_folder \
--max_batch_size 8 \
--max_input_len 2048 \
--max_output_len 512
--max_output_len 512 \
--strongly_typed
Known issues
^^^^^^^^^^^^
* Currently in NeMo quantizing and building TensorRT-LLM engines is limited to single-node use cases.
* Supported and tested model family is Llama2. Quantizing other model types is experimental and may not be fully supported.
* For INT8 SmoothQuant ``quantization.algorithm=int8_sq``, the TensorRT-LLM engine cannot be build with CLI ``trtllm-build`` command -- Python API and ``tensorrt_llm.builder`` should be used instead.
* Currently in NeMo, quantizing and building TensorRT-LLM engines is limited to single-node use cases.
* The supported and tested model family is Llama2. Quantizing other model types is experimental and may not be fully supported.


Please refer to the following papers for more details on quantization techniques.

References
----------

`Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation, 2020 <https://arxiv.org/abs/2004.09602>`_

`FP8 Formats for Deep Learning, 2022 <https://arxiv.org/abs/2209.05433>`_

`SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, 2022 <https://arxiv.org/abs/2211.10438>`_
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,13 @@ quantization:
algorithm: fp8 # int8_sq, fp8, int8, int4_awq, null
calib_dataset: cnn_dailymail # wikitext, cnn_dailymail, or a local dataset
num_calib_size: 512 # number of samples used for calibration
awq_block_size: 128 # block size for scaling factors in AWQ algorithm

export:
decoder_type: llama # gptnext, gpt2, llama
inference_tensor_parallel: 1 # Default using 1 TP for inference
inference_pipeline_parallel: 1 # Default using 1 PP for inference
dtype: 16 # Default precision data type
export_tensorrt_llm_config: true # export config to build TRT-LLM engine directly

model_file: llama2-7b-fp16.nemo # Nemo file path
model_save: llama2-7b-fp8.qnemo # Path where the quantized model will be saved
Expand Down
44 changes: 21 additions & 23 deletions examples/nlp/language_modeling/conf/megatron_retro_inference.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,42 +3,40 @@ inference:
top_k: 0 # The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
temperature: 1.0 # sampling temperature
add_BOS: True # add the bos token at the begining of the prompt
add_BOS: False # add the bos token at the begining of the prompt
tokens_to_generate: 30 # The minimum length of the sequence to be generated.
all_probs: False # whether return the log prob for all the tokens in vocab
repetition_penalty: 1.2 # The parameter for repetition penalty. 1.0 means no penalty.
min_tokens_to_generate: 0 # The minimum length of the sequence to be generated.
compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False

end_strings: ["<|endoftext|>"] # generation will stop when one of these tokens is generated
# RETRO-specific arguments
retro_inference:
retro_gpt_retrieved_length: 128
retro_num_neighbors: 2
ft_neighbours: 0
reuse_top: False

trainer:
devices: 1
num_nodes: 1
accelerator: gpu
logger: False # logger provided by exp_manager
precision: 16 # 16, 32, or bf16

inference_batch_size: 2
precision: 32 # 16, 32, or bf16
use_distributed_sampler: False
tensor_model_parallel_size: -1
pipeline_model_parallel_size: -1
pipeline_model_parallel_split_rank: -1 # used for encoder and decoder model (0 for others)
retro_model_file: null # RETRO nemo file path
megatron_amp_O2: False # Enable O2-level automatic mixed precision to save memory

use_predict_method: False # whether to use the predict method
retro_model_file: null # Retro nemo file path
checkpoint_dir: null # checkpoint file dir. This is used to load the PTL checkpoint generated during the Retro training
checkpoint_name: null # PTL checkpoint file name, only used for PTL checkpoint loading
hparams_file: null # model configuration file, only used for PTL checkpoint loading

prompts: # prompts for RETRO model inference
- "hello,"
- "good morning,"
- "good afternoon,"
- "good evening,"

########### Faiss service parameters ########
retrieval_service:
strategy: RetroModelTextGenerationStrategy # choose customized inference strategy
neighbors: 4
frequent_query: False # for the current token generation, frequently update the retrieval context. If false, update it every 64 tokens
pad_tokens: True # pad the tokens at the beginning to make it minimum of 64 tokens for retrieving at least once
store_retrieved: False # whether store the retrieved documents, so it can be checked
combo_service:
service_ip: '0.0.0.0'
service_port: 17181
# RETRO inference
prompt: "sample prompt"
neighbors:
- "neighbor text 1"
- "neighbor text 2"
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# (This inferencing script for native NeMo RETRO will be soon deprecated. For new inferencing script for mcore RETRO, see ./megatron_retro_inference.yaml)

inference:
greedy: False # Whether or not to use sampling ; use greedy decoding otherwise
top_k: 0 # The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
temperature: 1.0 # sampling temperature
add_BOS: True # add the bos token at the begining of the prompt
tokens_to_generate: 30 # The minimum length of the sequence to be generated.
all_probs: False # whether return the log prob for all the tokens in vocab
repetition_penalty: 1.2 # The parameter for repetition penalty. 1.0 means no penalty.
min_tokens_to_generate: 0 # The minimum length of the sequence to be generated.
compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False


trainer:
devices: 1
num_nodes: 1
accelerator: gpu
logger: False # logger provided by exp_manager
precision: 16 # 16, 32, or bf16

inference_batch_size: 2
tensor_model_parallel_size: -1
pipeline_model_parallel_size: -1
pipeline_model_parallel_split_rank: -1 # used for encoder and decoder model (0 for others)
retro_model_file: null # RETRO nemo file path

use_predict_method: False # whether to use the predict method

prompts: # prompts for RETRO model inference
- "hello,"
- "good morning,"
- "good afternoon,"
- "good evening,"

########### Faiss service parameters ########
retrieval_service:
strategy: RetroModelTextGenerationStrategy # choose customized inference strategy
neighbors: 4
frequent_query: False # for the current token generation, frequently update the retrieval context. If false, update it every 64 tokens
pad_tokens: True # pad the tokens at the beginning to make it minimum of 64 tokens for retrieving at least once
store_retrieved: False # whether store the retrieved documents, so it can be checked
combo_service:
service_ip: '0.0.0.0'
service_port: 17181
40 changes: 40 additions & 0 deletions examples/nlp/language_modeling/conf/megatron_retro_qatask.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
inference:
greedy: False # Whether or not to use sampling ; use greedy decoding otherwise
top_k: 0 # The number of highest probability vocabulary tokens to keep for top-k-filtering.
top_p: 0.9 # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
temperature: 1.0 # sampling temperature
add_BOS: False # add the bos token at the begining of the prompt
tokens_to_generate: 30 # The minimum length of the sequence to be generated.
all_probs: False # whether return the log prob for all the tokens in vocab
repetition_penalty: 1.2 # The parameter for repetition penalty. 1.0 means no penalty.
min_tokens_to_generate: 0 # The minimum length of the sequence to be generated.
compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False
end_strings: ["<|endoftext|>"] # generation will stop when one of these tokens is generated
# RETRO-specific arguments
retro_inference:
retro_gpt_retrieved_length: 128
retro_num_neighbors: 2
ft_neighbours: 0
reuse_top: False

trainer:
devices: 1
num_nodes: 1
accelerator: gpu
logger: False # logger provided by exp_manager
precision: 32 # 16, 32, or bf16
use_distributed_sampler: False

tensor_model_parallel_size: -1
pipeline_model_parallel_size: -1
pipeline_model_parallel_split_rank: -1 # used for encoder and decoder model (0 for others)
megatron_amp_O2: False # Enable O2-level automatic mixed precision to save memory

retro_model_file: null # Retro nemo file path
checkpoint_dir: null # checkpoint file dir. This is used to load the PTL checkpoint generated during the Retro training
checkpoint_name: null # PTL checkpoint file name, only used for PTL checkpoint loading
hparams_file: null # model configuration file, only used for PTL checkpoint loading

# qa tasks
qa_file_path: null
pred_file_path: null
Loading

0 comments on commit d4e836e

Please sign in to comment.