# Runing quantized models with MLC-LLM

Acknowledgement: This file is modified from the official MLC-LLM instruction [https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb](https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb).

This file including three section:
- [Environment setup and file download](#environment-setup-and-file-download)
- [Let's Chat!](#lets-chat)
- [Compile your own quantized models with MLC-LLM](#compile-your-own-quantized-models-with-mlc-llm)

## Environment setup and file download

Firstly, let's download the MLC-AI and MLC-Chat nightly build packages. Go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

In [None]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

Next, let's download the prebuilt model libraries from HuggingFace. In order to download the large weights, we'll have to use `git lfs`.

In [None]:
!conda install git git-lfs
!git lfs install

Download the prebuilt quantized models weights:

In [1]:
!mkdir -p dist/
# download llama-2-7b-chat with w3a16g128 quantization
!git clone https://huggingface.co/ChenMnZ/Llama-2-7b-chat-omniquant-w3a16g128asym ./dist/Llama-2-7b-chat-omniquant-w3a16g128asym
# optional, you can also download llama-2-13b-chat with w3a16g128 quantization
#!git clone https://huggingface.co/ChenMnZ/Llama-2-13b-chat-omniquant-w3a16g128asym ./dist/Llama-2-13b-chat-omniquant-w3a16g128asym

mkdir: cannot create directory ‘dist/’: File exists
fatal: destination path './dist' already exists and is not an empty directory.


## Let's Chat!

Before we can chat with the model, we must first import a library and instantiate a `ChatModule` instance. The `ChatModule` must be initialized with the appropriate model name.

In [2]:
from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

cm = ChatModule(model="dist/Llama-2-7b-chat-omniquant-w3a16g128asym/params", lib_path="dist/Llama-2-7b-chat-omniquant-w3a16g128asym/Llama-2-7b-chat-omniquant-w3a16g128asym-cuda.so")

That is all what needed to set up the `ChatModule`. You can now chat with the model by entering any prompt you'd like. Try it out below!

In [3]:
prompt = "What is the meaning of life?"
print(f"Prompt:{prompt}\n")
output = cm.generate(
    prompt=prompt,
    progress_callback=StreamToStdout(callback_interval=3),
)

Prompt:What is the meaning of life?

Ah, a question that has puzzled philosophers and theologians for centuries! The meaning of life is a complex and multi-faceted topic, and there is no one definitive answer. However, here are some possible perspectives on the meaning of life:

1. Religious or Spiritual Perspective: Many people believe that the meaning of life is to fulfill a divine or spiritual purpose. According to this view, the meaning of life is to follow the will of a higher power or to achieve spiritual enlightenment.
2. Personal Fulfillment Perspective: Others believe that the meaning of life is to achieve personal fulfillment and happiness. According to this view, the meaning of life is to pursue one's passions and interests, and to live a life that is fulfilling and satisfying.
3. Social Perspective: Some people believe that the meaning of life is to contribute to the greater good of society. According to this view, the meaning of life is to make a positive impact on the wor

In [4]:
prompt = "How many points did you list out?"
print(f"Prompt:{prompt}\n")
output = cm.generate(
    prompt=prompt,
    progress_callback=StreamToStdout(callback_interval=3),
)

Prompt:How many points did you list out?

I listed out 8 possible perspectives on the meaning of life in my previous response:
1. Religious or Spiritual Perspective
2. Personal Fulfillment Perspective
3. Social Perspective
4. Existentialist Perspective
5. Happiness Perspective
6. Purpose-Driven Perspective
7. Cultural Perspective
8. Personal Identity Perspective

I hope this helps! Let me know if you have any other questions.


You can also repeat running the code block below for multiple rounds to interact with the model in a chat style.

In [5]:
prompt = input("Prompt: ")
output = cm.generate(prompt=prompt, progress_callback=StreamToStdout(callback_interval=3))

Of course, here are more details on the third perspective on the meaning of life:
The third perspective on the meaning of life is that it is to contribute to the greater good of society. According to this view, the meaning of life is to make a positive impact on the world and to leave a lasting legacy. This perspective is often associated with altruism and a sense of social responsibility.
Some of the key aspects of this perspective include:
1. Social Justice: The belief that everyone should have equal access to opportunities and resources, and that it is important to work towards creating a more just and equitable society.
2. Community Involvement: The belief that individuals have a responsibility to contribute to the well-being of their community, whether through volunteering, activism, or other forms of service.
3. Personal Responsibility: The belief that individuals have a responsibility to take care of themselves and their loved ones, and to work towards creating a better world fo

To check the generation speed of the chat bot, you can print the statistics.

In [6]:
print(f"\nStatistics: {cm.stats()}\n")


Statistics: prefill: 240.0 tok/s, decode: 83.6 tok/s



By default, the `ChatModule` will keep a history of your chat. You can reset the chat history by running the following.

In [None]:
cm.reset_chat()

### Benchmark Performance

To benchmark the performance, we can use the `benchmark_generate` method of ChatModule. It takes an input prompt and the number of tokens to generate, ignores the system prompt and model stop criterion, generates tokens in a language model way and stops until finishing generating the desired number of tokens. After calling `benchmark_generate`, we can use `stats` to check the performance.

In [None]:
print(cm.benchmark_generate(prompt="What is benchmark?", generate_length=512))
cm.stats()

## Compile your own quantized models with MLC-LLM

Firstly, saveing the fakely quantized models.

In [None]:
!CUDA_VISIBLE_DEVICES=0 python main.py \
--epochs 0 --output_dir ./log/temp \
--wbits 3 --abits 16 --group_size 128 --lwc \
--model /PATH/TO/LLaMA/Llama-2-7b-chat \
--save_dir /PATH/TO/SAVE/Llama-2-7b-chat-omniquant-w3a16g128 \
--resume /PATH/TO/CHECKPOINTS/Llama-2-7b-chat-w3a16g128.pth

To compile the pseudo-quantized models, refer to [https://mlc.ai/mlc-llm/docs/compilation/compile_models.html](https://mlc.ai/mlc-llm/docs/compilation/compile_models.html). Be aware that the MLC-LLM project has its customized quantization schemes in [mlc_llm/quantization/__init__.py](https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/quantization/__init__.py). Please incorporate the W3A16g128 quantization scheme into the aforementioned [mlc_llm/quantization/__init__.py](https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/quantization/__init__.py).

In [None]:
    "w3a16g128asym": QuantizationScheme(
        name="w3a16g128asym",
        linear_weight=GroupQuantizationSpec(
            dtype="float16",
            mode="int3",
            sym=False,
            storage_nbit=16,
            group_size=128,
            transpose=False,
        ),
        embedding_table=None,
        final_fc_weight=None,
    ),

You can probably figure out the general pattern for any other quantization scheme you want to do. Once you have this, then you can compile the MLC-LLM model, which should be quick:

In [None]:
!python build.py --target cuda --quantization w3a16g128asym --model /PATH/TO/SAVE/Llama-2-7b-chat-omniquant-w3a16g128 --use-cache=0

Now you will have a new dist/ folder with the compiled model.You can refer the official [MLC-LLM docs](https://mlc.ai/mlc-llm/docs/) for more diverse usage of quantized models.