# LLM optimization examples with NNCF and OpenVINO

>**Note:** Make sure that you have enought disk space as the environment and the model can take sevaral gigabytes of you space

## Install pre-requisites

In [1]:
# Uncomment for a new environment

#%pip install diffusers
#%pip install openvino nncf
#%pip install git+https://github.com/huggingface/optimum.git
#%pip install git+https://github.com/huggingface/optimum-intel.git

## Floating-point model inference

In [2]:
from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM
import openvino as ov
import time

MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"
LOCAL_PATH = "Phi-3-mini-4k-instruct"

model = OVModelForCausalLM.from_pretrained(MODEL_ID, export=True, load_in_8bit=False, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

template = "<|user|>\n{}<|end|>\n<|assistant|>"
question = "Hey, model! Tell me about LLM?"
prompt = template.format(question)
inputs = tokenizer(prompt, return_tensors="pt")

start = time.time()
output = model.generate(**inputs, max_new_tokens=256)
print("Avarage latency per token: ", (time.time() - start) / output.shape[1])

answer = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print("Answer: ", answer)

model.save_pretrained(LOCAL_PATH)
tokenizer.save_pretrained(LOCAL_PATH)

Framework not specified. Using pt to export the model.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Using framework PyTorch: 2.3.0+cu121
The model type phi3 is not yet supported to be used with BetterTransformer. Feel free to open an issue at https://github.com/huggingface/optimum/issues if you would like this model type to be supported. Currently supported models are: dict_keys(['albert', 'bark', 'bart', 'bert', 'bert-generation', 'blenderbot', 'bloom', 'camembert', 'blip-2', 'clip', 'codegen', 'data2vec-text', 'deit', 'distilbert', 'electra', 'ernie', 'fsmt', 'gpt2', 'gptj', 'gpt_neo', 'gpt_neox', 'hubert', 'layoutlm',

Avarage latency per token:  0.46480048232608373
Answer:  Hey, model! Tell me about LLM? An LLM, or a Large Language Model, is a type of artificial intelligence (AI) model designed to understand, generate, and interact with human language. These models are based on deep learning algorithms, particularly a variant of the transformer architecture, which allows them to process and analyze vast amounts of text data.

Large Language Models are trained on diverse and extensive text corpora, such as books, articles, and websites, enabling them to learn patterns, grammar, and context in human language. Some popular examples of LLMs include GPT-3 (developed by Microsoft), BERT (developed by Google), and T5 (also developed by Google).

These models can perform a wide range of language-related tasks, such as:

1. Text generation: Creating coherent and contextually relevant text based on a given prompt or topic.
2. Text completion: Filling in missing words or phrases in a sentence or paragraph.
3. 

('Phi-3-mini-4k-instruct/tokenizer_config.json',
 'Phi-3-mini-4k-instruct/special_tokens_map.json',
 'Phi-3-mini-4k-instruct/tokenizer.model',
 'Phi-3-mini-4k-instruct/added_tokens.json',
 'Phi-3-mini-4k-instruct/tokenizer.json')

## 8-bit weight quantization

In [3]:
from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM
import openvino as ov
import time

MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"

model = OVModelForCausalLM.from_pretrained(LOCAL_PATH, quantization_config=dict(bits=8), compile=False, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(LOCAL_PATH)

template = "<|user|>\n{}<|end|>\n<|assistant|>"
question = "Hey, model! Tell me about LLM?"
prompt = template.format(question)
inputs = tokenizer(prompt, return_tensors="pt")

start = time.time()
output = model.generate(**inputs, max_new_tokens=256)
print("Average latency per token: ", (time.time() - start) / output.shape[1])

answer = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print("Answer: ", answer)

ValueError: Loading Phi-3-mini-4k-instruct/ requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option `trust_remote_code=True` to remove this error.

## 4-bit data-aware weight quantization

In [4]:
from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM

MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"
QUANTIZED_PATH = "Phi-3-mini-4k-instruct-quantized"

model = OVModelForCausalLM.from_pretrained(
    LOCAL_PATH,
    compile=False,
    trust_remote_code=True,
    quantization_config=dict(bits=4, sym=True, ratio=0.8, dataset="ptb", awq=True, scale_estimation=True)
)
tokenizer = AutoTokenizer.from_pretrained(LOCAL_PATH)

template = "<|user|>\n{}<|end|>\n<|assistant|>"
question = "Hey, model! Tell me about LLM?"
prompt = template.format(question)
inputs = tokenizer(prompt, return_tensors="pt")

start = time.time()
output = model.generate(**inputs, max_new_tokens=256)
print("Average latency per token: ", (time.time() - start) / output.shape[1])

answer = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print("Answer: ", answer)

model.save_pretrained(QUANTIZED_PATH)

The argument `trust_remote_code` is to be used along with export=True. It will be ignored.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (1195302 > 4096). Running this sequence through the model will result in indexing errors


Output()

## Dynamic Quantization and KV-cache Quantization

In [None]:
from transformers import AutoTokenizer
from optimum.intel import OVModelForCausalLM

MODEL_ID = "microsoft/Phi-3-mini-4k-instruct"

model = OVModelForCausalLM.from_pretrained(
    QUANTIZED_PATH,
    trust_remote_code=True,
    ov_config={"KV_CACHE_PRECISION": "u8", "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32", "PERFORMANCE_HINT": "LATENCY"},
)
tokenizer = AutoTokenizer.from_pretrained(LOCAL_PATH)

template = "<|user|>\n{}<|end|>\n<|assistant|>"
question = "Hey, model! Tell me about LLM?"
inputs = tokenizer(prompt, return_tensors="pt")

start = time.time()
output = model.generate(**inputs, max_new_tokens=256)
print("Average latency per token: ", (time.time() - start) / output.shape[1])

answer = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
print("Answer: ", answer)

Compiling the model to CPU ...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Average latency per token:  0.053900335889413536
Answer:  Hey, model! Tell me about LLM? Large Language Models (LLMs) are a type of artificial intelligence designed to understand, generate, and sometimes translate human language. They are based on deep learning algorithms, particularly neural networks, that have been trained on vast amounts of text data.


LLMs like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) have significantly impacted natural language processing (NLP). They can perform a variety of tasks such as text classification, question answering, and language translation.


These models work by predicting the likelihood of a sequence of words, which allows them to generate coherent and contextually relevant text. The "transformer" architecture, which LLMs often use, is particularly effective because it processes words in relation to all other words in a sentence, rather than one at a time, enabling a deeper underst