<a href="https://colab.research.google.com/github/Rostenbach/Notebooks/blob/Notebooks/mlc-llm/tutorial_chat_module_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with MLC-LLM using the Llama 2 Model

Here's a quick overview of how to get started with the MLC-LLM `ChatModule` in Python. In this tutorial, we will chat with the [Llama2](https://ai.meta.com/llama/) model. For the easiest setup, we recommend trying this out in a Google Colab notebook. Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Environment Setup

Let's set up your environment, so you can successfully run the `ChatModule`. First, let's set up the Conda environment which we will be running this notebook in (not required if running in Google Colab).

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

**Google Colab:** If you are running this in a Google Colab notebook, be sure to change your runtime to GPU by going to Runtime > Change runtime type and setting the Hardware accelerator to be "GPU". Select "Connect" on the top right to instantiate your GPU session.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the version number.

In [1]:
!nvidia-smi

Fri Sep 15 23:35:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, let's download the MLC-AI and MLC-Chat nightly build packages. Go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

In [2]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu118-0.12.dev1574-cp310-cp310-manylinux_2_28_x86_64.whl (98.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlc-chat-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu118-0.1.dev423-cp310-cp310-manylinux_2_28_x86_64.whl (21.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.5/21.5 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu118)
  Downloading attrs-23.1.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cloudpickle (from mlc-ai-nightly-cu118)
  Downloading cloudpickle-2.2.1-py3-none-any.whl (2

**Google Colab:** If in Google Colab, you may see a message warning you to restart the runtime. Simply run the following code in a new code cell to restart the runtime.

```python
import os
os.kill(os.getpid(), 9)
```

Next, let's download the model weights for the Llama2 model and the prebuilt model libraries from Github. In order to download the large weights, we'll have to use `git lfs`.

Note: If you are NOT running in **Google Colab** you may need to run this line `!conda install git git-lfs` to install `git` and `git-lfs` before running the following cell to fully install `git lfs`.

In [3]:
!git lfs install

Git LFS initialized.


These commands will download many prebuilt libraries as well as the chat configuration for Llama-2-7b that `mlc_chat` needs, which may take a long time. If in **Google Colab** you can verify that the files are being downloaded by clicking on the folder icon on the left and navigating to the `dist` and then `prebuilt` folders which should be updating as the files are being downloaded.

In [4]:
!mkdir -p dist/prebuilt
!git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib

Cloning into 'dist/prebuilt/lib'...
remote: Enumerating objects: 290, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 290 (delta 61), reused 49 (delta 47), pack-reused 217[K
Receiving objects: 100% (290/290), 67.14 MiB | 13.93 MiB/s, done.
Resolving deltas: 100% (210/210), done.
Updating files: 100% (92/92), done.


In [5]:
!cd dist/prebuilt && git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1

Cloning into 'mlc-chat-Llama-2-7b-chat-hf-q4f16_1'...
remote: Enumerating objects: 129, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 129 (delta 0), reused 0 (delta 0), pack-reused 126[K
Receiving objects: 100% (129/129), 500.53 KiB | 19.25 MiB/s, done.
Filtering content: 100% (116/116), 3.53 GiB | 68.98 MiB/s, done.


## Let's Chat!

Before we can chat with the model, we must first import a library and instantiate a `ChatModule` instance. The `ChatModule` must be initialized with the appropriate model name.

In [2]:
from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")

Note that the above invocation abstracts away the logic for finding the relevant model directory and prebuilt library paths. To specify these manually, you could run the following instead (which would be equivalent to the above).

```python
cm = ChatModule(model="dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1", lib_path="dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so")
```

That is all what needed to set up the `ChatModule`. You can now chat with the model by entering any prompt you'd like. Try it out below!

In [3]:
output = cm.generate(
    prompt="When was Python released?",
    progress_callback=StreamToStdout(callback_interval=2),
)

Great, thank you for asking! Python was first released in 1991 by Guido van Rossum. He started working on Python in December 1989 while he was still working at the National Research Institute for Mathematics and Computer Science in the Netherlands. The first publicly available version of Python, version 0.9.1, was released on February 20, 1991. Since then, Python has grown to become one of the most popular programming languages in the world, known for its simplicity, readability, and versatility.


You can also repeat running the code block below for multiple rounds to interact with the model in a chat style.

In [4]:
prompt = input("Prompt: ")
output = cm.generate(prompt=prompt, progress_callback=StreamToStdout(callback_interval=2))

Prompt: I am a novelist and I am writing a deep expensive series of books that will include some very immersive world building. This is intended to be a continuation of the works in the Lovecraft mythos and partially set within the dreamland. I need you to act as a researcher and assistant. Tell me everything you can about The council of elders
Of course, I'm happy to help! The Council of Elders is a fascinating aspect of the Lovecraftian mythos, and I'll do my best to provide you with as much information as possible.

The Council of Elders is a group of ancient, powerful beings who are said to reside in the Dreamland, a realm that exists parallel to our own. According to Lovecraftian lore, the Council is composed of various elder gods, including Azathoth, Yog-Sothoth, and Nyarlathotep, among others. These deities are believed to have created the Dreamland as a refuge for themselves, a place where they could escape the mundane world and indulge in their dark, twisted desires.

The Coun

In [5]:
output = cm.generate(
    prompt="Please summarize your response in three sentences.",
    progress_callback=StreamToStdout(callback_interval=2),
)

Of course! Here's a summary of my response:

The Council of Elders is a group of ancient, powerful beings in the Lovecraftian mythos who reside in the Dreamland. They are believed to manipulate reality, possess vast knowledge, and possess incredible magical abilities. Despite their immense power, the Council is not above criticism, and some have accused them of being manipulative, cruel, and even tyrannical.


To check the generation speed of the chat bot, you can print the statistics.

In [19]:
print(cm.stats())

prefill: 173.9 tok/s, decode: 31.0 tok/s


By default, the `ChatModule` will keep a history of your chat. You can reset the chat history by running the following.

In [14]:
#cm.reset_chat()
print(cm.generate(output))

Great! Here's a summary of the Council of Elders in three sentences:

The Council of Elders is a group of powerful beings in the Lovecraftian mythos who reside in the Dreamland. They manipulate reality, possess vast knowledge, and possess incredible magical abilities. Despite their immense power, the Council is not above criticism, with some accusing them of being manipulative, cruel, and even tyrannical.


### Benchmark Performance

To benchmark the performance, we can use the `benchmark_generate` method of ChatModule. It takes an input prompt and the number of tokens to generate, ignores the system prompt and model stop criterion, generates tokens in a language model way and stops until finishing generating the desired number of tokens. After calling `benchmark_generate`, we can use `stats` to check the performance.

In [21]:
print(cm.benchmark_generate(prompt="What is benchmark?", generate_length=512))
cm.stats()

 What is the difference between a benchmark and a reference?  What is the difference between a benchmark and a standard?  What is the difference between a benchmark and a yardstick?  What is the difference between a benchmark and a touchstone?  What is the difference between a benchmark and a marker?  What is the difference between a benchmark and a point of reference?  What is the difference between a benchmark and a reference point?  What is the difference between a benchmark and a point of comparison?  What is the difference between a benchmark and a point of reference?  What is the difference between a benchmark and a reference point?  What is the difference between a benchmark and a point of comparison?  What is the difference between a benchmark and a reference point?  What is the difference between a benchmark and a point of reference?  What is the difference between a benchmark and a reference point?  What is the difference between a benchmark and a point of comparison?  What i

'prefill: 43.1 tok/s, decode: 40.0 tok/s'