Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.
SPDX-License-Identifier: Apache-2.0


### Running Hugging Face with FP8 on Intel® Gaudi®  - Text Generation

This example shows how to quantize a Hugging Face models from fp32 to fp8 with Intel Gaudi and the Optimum for Intel Gaudi (aka Optimum Habana) library.

Llama2-70b, Llama2-7b, Llama3-70b, Llama3-8b, Mixtral-8x7B, Falcon-7B, Falcon-40B, Falcon-180B, phi-2 and Llama3-405B in FP8 are enabled using the [Intel Neural Compressor (INC)](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html), which provides model measurement and quantization capabilities in PyTorch. From synapse 1.17 / optimum-habana 1.13 release, INC is used by default for measuring and quantization. Habana Quantization Toolkit (HQT), which was used earlier, will be removed in future releases. To use HQT, disable INC by setting the following environment variable: `USE_INC=0`.

More information on enabling fp8 in SynapseAI is available here:
https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html



#### Install the Hugging Face Optimum Habana Library

In [None]:
#%cd ~/Gaudi-tutorials/PyTorch/Hugging_Face_pipelines/Benchmarking_on_Optimum-habana_with_fp8
%pip install optimum-habana==1.15.0

#### Download the Hugging Face Optimum Habana

In [None]:
!git clone -b v1.15.0 https://github.com/huggingface/optimum-habana.git;cd optimum-habana/examples/text-generation

#### Install Required packages

In [None]:
!pip install -r requirements.txt;pip install -r requirements_lm_eval.txt;pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0

#### Measure the tensor quantization statistics 
Here is an example to measure the tensor quantization statistics on Llama3-8B with 1 card:  
By changing model_name_or_path, a different llama model could be applied.  
By changing world_size, multiple gaudi cards could be used for measurement. 

In [2]:
!QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python ../gaudi_spawn.py \
--use_deepspeed --world_size 1 run_lm_eval.py \
-o acc_llama3_405b_bs1_quant.txt \
--model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
--use_hpu_graphs \
--use_kv_cache \
--trim_logits \
--batch_size 1 \
--bf16 \
--reuse_cache \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask

  from .autonotebook import tqdm as notebook_tqdm


#### Quantize and run the fp8 model
Here is an example to quantize the model based on previous measurements for LLama3.1 8B model:  
By changing model_name_or_path, a different llama model could be applied.  
By changing world_size, multiple gaudi cards could be used for measurement. 

In [3]:
!QUANT_CONFIG=./quantization_config/maxabs_quant.json python ../gaudi_spawn.py \
--use_deepspeed --world_size 1 run_generation.py \
--model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
--use_hpu_graphs \
--use_kv_cache \
--limit_hpu_graphs \
--max_input_tokens 2048 \
--max_new_tokens 2048 \
--batch_size 2 \
--bf16 \
--reuse_cache \
--trim_logits \
--use_flash_attention \
--flash_attention_recompute \
--flash_attention_causal_mask