<a href="https://colab.research.google.com/github/Fredf23/AIU_resume/blob/master/mlc-llm/tutorial_chat_module_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with MLC-LLM using the Llama 2 Model

Here's a quick overview of how to get started with the MLC-LLM `ChatModule` in Python. In this tutorial, we will chat with the [Llama2](https://ai.meta.com/llama/) model. For the easiest setup, we recommend trying this out in a Google Colab notebook. Click the button below to get started!

<a target="_blank" href="https://colab.research.google.com/github/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Environment Setup

Let's set up your environment, so you can successfully run the `ChatModule`. First, let's set up the Conda environment which we will be running this notebook in (not required if running in Google Colab).

```bash
conda create --name mlc-llm python=3.10
conda activate mlc-llm
```

**Google Colab:** If you are running this in a Google Colab notebook, be sure to change your runtime to GPU by going to Runtime > Change runtime type and setting the Hardware accelerator to be "GPU". Select "Connect" on the top right to instantiate your GPU session.

If you are using CUDA, you can run the following command to confirm that CUDA is set up correctly, and check the version number.

In [2]:
!nvidia-smi

Tue Oct 17 04:14:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, let's download the MLC-AI and MLC-Chat nightly build packages. Go to https://mlc.ai/package/ and replace the command below with the one that is appropriate for your hardware and OS.

In [3]:
!pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels

Looking in links: https://mlc.ai/wheels
Collecting mlc-ai-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly_cu118-0.12.dev1689-cp310-cp310-manylinux_2_28_x86_64.whl (510.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.6/510.6 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlc-chat-nightly-cu118
  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_chat_nightly_cu118-0.1.dev502-cp310-cp310-manylinux_2_28_x86_64.whl (47.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting attrs (from mlc-ai-nightly-cu118)
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting cloudpickle (from mlc-ai-nightly-cu118)
  Using cached cloudpickle-3.0.0-py3-none-any.whl (20 kB)
Collecting decorator (from mlc-ai-nightly-cu118)
  Using cached decorator-5.1.1-py3-none-any.whl (9.1 kB)
Collectin

**Google Colab:** If in Google Colab, you may see a message warning you to restart the runtime. Simply run the following code in a new code cell to restart the runtime.

```python
import os
os.kill(os.getpid(), 9)
```

Next, let's download the model weights for the Llama2 model and the prebuilt model libraries from Github. In order to download the large weights, we'll have to use `git lfs`.

Note: If you are NOT running in **Google Colab** you may need to run this line `!conda install git git-lfs` to install `git` and `git-lfs` before running the following cell to fully install `git lfs`.

In [4]:
!git lfs install

Git LFS initialized.


These commands will download many prebuilt libraries as well as the chat configuration for Llama-2-7b that `mlc_chat` needs, which may take a long time. If in **Google Colab** you can verify that the files are being downloaded by clicking on the folder icon on the left and navigating to the `dist` and then `prebuilt` folders which should be updating as the files are being downloaded.

In [5]:
!mkdir -p dist/prebuilt
!git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib

fatal: destination path 'dist/prebuilt/lib' already exists and is not an empty directory.


In [6]:
!cd dist/prebuilt && git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1

fatal: destination path 'mlc-chat-Llama-2-7b-chat-hf-q4f16_1' already exists and is not an empty directory.


## Let's Chat!

Before we can chat with the model, we must first import a library and instantiate a `ChatModule` instance. The `ChatModule` must be initialized with the appropriate model name.

In [7]:
from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")

Note that the above invocation abstracts away the logic for finding the relevant model directory and prebuilt library paths. To specify these manually, you could run the following instead (which would be equivalent to the above).

```python
cm = ChatModule(model="dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1", lib_path="dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so")
```

That is all what needed to set up the `ChatModule`. You can now chat with the model by entering any prompt you'd like. Try it out below!

In [8]:
output = cm.generate(
    prompt="When was Python released?",
    progress_callback=StreamToStdout(callback_interval=2),
)

Great, thank you for asking! Python was first released in 1991 by Guido van Rossum. The first version, Python 0.9.1, was released on February 20, 1991. Since then, Python has undergone many updates and improvements, with the latest version being Python 3.10.3, which was released on August 26, 2022.


You can also repeat running the code block below for multiple rounds to interact with the model in a chat style.

In [9]:
# @title Текст заголовка по умолчанию
prompt = input("Prompt: ")o
output = cm.generate(prompt=prompt, progress_callback=StreamToStdout(callback_interval=2))

Prompt: Cognitive comlexity
Great, thank you for asking! Cognitive complexity refers to the degree of difficulty or challenge involved in understanding or solving a problem. It is a measure of how complex or intricate a problem is, and how much mental effort or cognitive resources are required to solve it.
Cognitive complexity can be thought of as a multidimensional construct, with different dimensions representing different aspects of complexity. For example, there may be different dimensions for:
1. Conceptual complexity: This refers to the number and complexity of concepts or ideas involved in a problem. For example, a problem that involves multiple variables, relationships, or concepts may have high conceptual complexity.
2. Logical complexity: This refers to the number and complexity of logical operations or reasoning required to solve a problem. For example, a problem that involves multiple logical operations, such as AND, OR, and NOT, may have high logical complexity.
3. Tempora

In [11]:
output = cm.generate(
    prompt="Please summarize your response in three sentences.",
    progress_callback=StreamToStdout(callback_interval=2),
)

Sure, here's a summary of my response:
Cognitive complexity refers to the difficulty or challenge involved in understanding or solving a problem, which can be measured across different dimensions such as conceptual, logical, temporal, spatial, syntactic, semantic, pragmatic, cognitive load, and knowledge complexity. Understanding and managing cognitive complexity is crucial for problem-solving, learning, and decision-making in various domains. By recognizing and categorizing cognitive complexity, individuals and organizations can better navigate complex problems and make informed decisions.


To check the generation speed of the chat bot, you can print the statistics.

In [12]:
print(cm.stats())

prefill: 103.5 tok/s, decode: 27.0 tok/s


By default, the `ChatModule` will keep a history of your chat. You can reset the chat history by running the following.

In [None]:
cm.reset_chat()

### Benchmark Performance

To benchmark the performance, we can use the `benchmark_generate` method of ChatModule. It takes an input prompt and the number of tokens to generate, ignores the system prompt and model stop criterion, generates tokens in a language model way and stops until finishing generating the desired number of tokens. After calling `benchmark_generate`, we can use `stats` to check the performance.

In [13]:
print(cm.benchmark_generate(prompt="What is benchmark?", generate_length=512))
cm.stats()


 everybody uses benchmark to compare the performance of different systems or algorithms, but what is it really?

Benchmarking is the process of measuring the performance of a system, application, or algorithm by running it through a series of predefined tests or tasks. The results of these tests are then analyzed to determine the performance of the system, application, or algorithm in various areas, such as speed, efficiency, scalability, and reliability.
The goal of benchmarking is to provide a fair and consistent comparison of different systems or algorithms, allowing users to make informed decisions about which one is best suited for their specific needs. Benchmarking can be used in a variety of fields, including computer science, engineering, finance, and medicine, and is an essential tool for evaluating the performance of complex systems and algorithms.
There are several types of benchmarking, including:
1. Absolute benchmarking: This involves measuring the performance of a syste

'prefill: 40.6 tok/s, decode: 30.3 tok/s'