# Notebook 4.1: Run Transformer Models

BigDL-LLM supports the optimization of any Hugging Face *transformers* model, allowing for efficient inference with significantly reduced latency. With the help of BigDL-LLM, PyTorch models (in FP16/BF16/FP32) from Hugging Face can be loaded with implicit quantization, so that heavy operations in Transformer can be speeded up through low precision (such as INT4/INT5/INT8, etc.).

In this tutorial, we will dive into the main usage of BigDL-LLM Transformers-style API for low-precision optimizations. Based on that, we will build a chatot application.

## 4.1.1 Install BigDL-LLM

Follow instructions in [Chapter 2](../ch_2_Environment_Setup/) to setup your environment if you haven't done so. Then install `bigdl-llm`:

In [None]:
!pip install BigDL-LLM[all]

## 4.1.2 Load Model

To leverage the benefits of BigDL-LLM, the first step is to load the transformers model with BigDL-LLM's low-precision optimizations. There are several use cases, which include loading models in low-precision, as well as saving and loading low-precision models.

For illustration purposes, let's take model [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.

### 4.1.2.0 Download Llama 2 (7B)

To download the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model from Hugging Face, you will need to obtain access granted by Meta. Please follow the instructions provided [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) to request access to the model.

After receiving the access, download the model with your Hugging Face token:

In [None]:
from huggingface_hub import snapshot_download

model_path = snapshot_download(repo_id='/meta-llama/Llama-2-7b-chat-hf',
                               token='hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') # change it to your own Hugging Face access token

> **Note**
>
> The model will by default be downloaded to `HF_HOME='~/.cache/huggingface'`.

### 4.1.2.1 Load Model in Low Precision

One common use case is to load a Hugging Face *transformers* model in low precision, i.e. conduct **implicit** quantization while loading.

For Llama 2 (7B), you could simply import `bigdl.llm.transformers.AutoModelForCausalLM` instead of `transformers.AutoModelForCausalLM`, and specify `load_in_4bit=True` or `load_in_low_bit` parameter accordingly in the `from_pretrained` function. Compared to the Hugging Face *transformers* API, only minor code changes are required.

**For INT4 Optimizations (with `load_in_4bit=True`):**

In [None]:
from bigdl.llm.transformers import AutoModelForCausalLM

model_in_4bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
                                                     load_in_4bit=True)

> **Note**
>
> BigDL-LLM has supported `AutoModel`, `AutoModelForCausalLM`, `AutoModelForSpeechSeq2Seq` and `AutoModelForSeq2SeqLM`.

**For INT8 Optimizations (with `load_in_low_bit="sym_int8"`):**

In [None]:
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
model_in_8bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
                                                     load_in_low_bit="sym_int8")

> **Note**
>
> Currently, `load_in_low_bit` supports options `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'` or `'sym_int8'`, in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization.
>
> It is worth mentioning that `load_in_4bit=True` is equivalent to `load_in_low_bit='sym_int4'`.

The corresponding tokenizer of Llama 2 (7B) can be loaded with official Hugging Face *transformers* API:

In [None]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf")

### 4.1.2.2 Save & Load Low-Precision Model

When conduct implicit quantization while loading a model, BigDL-LLM converts linear layers in the model into low-precision format. Taking INT4 as an example, in theory, a model with *X* B(illion) parameters saved in 16 or 32 bit will requires approximately 2*X* or 4*X* GB of memory for loading in 4 bit. Thus, for extremely large models like the 40B Falcon, 70B Llama 2, 176B Bloom etc., loading them with implicit low-precision quantization of BigDL-LLM can be both resource-intensive and time-consuming, and may even become impossible on memory-limited machines.

To address this issue, BigDL-LLM provides support for saving *transformers* models in BigDL-LLM low-precision format. Once the model is optimized and saved in this format, it can be loaded directly for subsequent inference, eliminating the need for repeated quantization. The saving and loading process can be completed on different machines.

**Save Low-Precision Model**

Let's take the `model_in_4bit` in section [4.1.2.1](#4121-load-model-in-low-precision) as an example. After we loading Llama 2 (7B) in 4 bit, we could use the `save_low_bit` function to save the optimized model:

In [None]:
save_directory='./llama-2-7b-bigdl-llm-4-bit'

model_in_4bit.save_low_bit(save_directory)

We recommend saving the tokenizer in the same directory as the optimized model to simplify the subsequent loading process:

In [None]:
tokenizer.save_pretrained(save_directory)

**Load Low-Precision Model**

We could load the optimized low-precision model through `load_low_bit` function, and load tokenizer from the same saved directory:

In [None]:
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
loaded_4bit_model = AutoModelForCausalLM.load_low_bit(save_directory)

loaded_tokenizer = LlamaTokenizer.from_pretrained(save_directory)

## 4.1.3 Run Model

By utilizing BigDL-LLM optimized *transformers* model in low-precision, it becomes possible to run the model with reduced latency. The basic usage of the optimized model for direct text completion or token prediction can be found in [chapter 2](../ch_3_Quick_Start/). Additionally, this tutorial will introduce some advanced usages for large language models with BigDL-LLM low-precision optimizations.

### 4.1.3.1 Chat

One common application of large language models is as chatbots, where they can engage in interactive conversations. Chatbot interaction is not based on any magic; instead, it still relies on the prediction and generation of text by large language models. These models use formatted, incomplete conversation context as input to generate appropriate responses. For example, consider the following context:

```
### HUMAN:

What is AI?

### RESPONSE:
```

In multi-turn chatting, the generated texts by models are added into the existing conversation context. This allows for a continuous conversation flow:

```
### HUMAN:

What is AI?

### RESPONSE:

AI is a term used to describe the development of computer systems that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images.

### HUMAN:

Is it dangerous?

### RESPONSE:
```

Here shows a multi-turn chat example using official `transformers` API with BigDL-LLM optimized Llama 2 (7B) model. First we need to define the conversation context format for the model to complete:

In [None]:
HUMAN_ID = "### HUMAN:\n\n"
BOT_ID = "### RESPONSE:\n\n"

def format_prompt(input_str, chat_history):
    prompt = ""
    for history_input, history_response in chat_history:
      prompt += f"{HUMAN_ID}{history_input}\n\n{BOT_ID}{history_response}\n\n"
    prompt += f"{HUMAN_ID}{input_str}\n\n{BOT_ID}"
    return prompt

Stopping criteria during text generation is also defined here to avoid Llama 2 (7B) from self-questioning:

In [None]:
from transformers.tools.agents import StopSequenceCriteria
from transformers.generation.stopping_criteria import StoppingCriteriaList

stop_word = "###"
stopping_criteria = StoppingCriteriaList([StopSequenceCriteria(stop_word, tokenizer)])

Next, we can define the `chat` function, which continuously adds model outputs to the chat history. This ensures that conversation context can be properly formatted for next generation of responses:

In [None]:
def chat(model, tokenizer, input_str, chat_history):
    # format conversation context as prompt through chat history
    prompt = format_prompt(input_str, chat_history)
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # predict next tokens with stopping_criteria
    output_ids = model.generate(input_ids,
                                max_new_tokens=128,
                                stopping_criteria=stopping_criteria)

    output_str = tokenizer.decode(output_ids[0][len(input_ids[0]):], # skip prompt in generated tokens
                                  skip_special_tokens=True)
    print(f"Response: {output_str.replace(stop_word, '').rstrip()}")

    # add model output to the chat history
    chat_history.append((input, output_str.replace(stop_word, "").rstrip()))

> **Note**
>
> BigDL-LLM optimized low-precision models are compatible with all Hugging Face *transformers* APIs. Therefore, in addition to using the `generate` function for token prediction, you can also utilize other methods such as the [`TextGenerationPipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextGenerationPipeline).

We can then facilitate interactive, multi-turn chat between humans and the bot by allowing for continuous user input:

In [11]:
import torch

chat_history = []

while True:
    with torch.inference_mode():
        user_input = input("Input: ")
        if user_input == "stop": # let's stop the conversation when user input "stop"
          print("Chat with Llama 2 (7B) stopped.")
          break
        chat(model=model_in_4bit,
             tokenizer=tokenizer,
             input_str=user_input,
             chat_history=chat_history)

Input:  What is AI?


Response: AI is a branch of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing images, making decisions, and solving problems. AI research involves developing algorithms and models that can learn from data, make predictions, and take actions based on that data.


Input:  Is it dangerous?


Response: The potential dangers of AI are a topic of ongoing debate and research. Some concerns include:

1. Job displacement: AI could automate many jobs currently performed by humans, leading to job loss and economic disruption.
2. Bias and discrimination: AI systems can perpetuate and even amplify existing biases and discrimination if they are trained on biased data.
3. Privacy and security risks: AI systems can potentially collect and process large amounts of personal data, which can raise privacy and security concerns.
4. Autonomous weapons: The


Input:  stop


Chat with Llama 2 (7B) stopped.


### 4.1.3.2 Stream Chat

Stream chat can be considered as an advanced function for a chatbot, where the response is generated word by word. Here, we define the `stream_chat` function with the help of `transformers.TextIteratorStreamer`:

In [None]:
from transformers import TextIteratorStreamer

def stream_chat(model, tokenizer, input_str, chat_history):
    # format conversation context as prompt through chat history
    prompt = format_prompt(input_str, chat_history)
    input_ids = tokenizer([prompt], return_tensors='pt')

    streamer = TextIteratorStreamer(tokenizer,
                                    skip_prompt=True, # skip prompt in the generated tokens
                                    skip_special_tokens=True)

    generate_kwargs = dict(
        input_ids,
        streamer=streamer,
        max_new_tokens=128,
        stopping_criteria=stopping_criteria
    )
    
    # to ensure non-blocking access to the generated text, generation process should be ran in a separate thread
    from threading import Thread
    
    thread = Thread(target=model.generate, kwargs=generate_kwargs)
    thread.start()

    output_str = []
    print("Response: ", end="")
    for stream_output in streamer:
        output_str.append(stream_output)
        print(stream_output.replace(stop_word, ""), end="")

    # add model output to the chat history
    chat_history.append((input, ''.join(output_str).replace(stop_word, "").rstrip()))

> **Note**
>
> To successfully observe the text streaming behavior in standard output, we need to set the environment variable `PYTHONUNBUFFERED=1 `to ensure that the standard output streams are directly sent to the terminal without being buffered first.
>
> The [Hugging Face *transformers* streamer classes](https://huggingface.co/docs/transformers/main/generation_strategies#streaming) is currently being developed and is subject to future changes.

We can then achieve interactive, multi-turn stream chat between humans and the bot by allowing continuous user input as before:

In [14]:
chat_history = []

while True:
    with torch.inference_mode():
        user_input = input("Input: ")
        if user_input == "stop": # let's stop the conversation when user input "stop"
          print("Stream Chat with Llama 2 (7B) stopped.")
          break
        stream_chat(model=model_in_4bit,
                    tokenizer=tokenizer,
                    input_str=user_input,
                    chat_history=chat_history)

Input:  What is CPU?


Response: CPU stands for Central Processing Unit. It is the primary component of a computer that performs calculations and executes instructions. The CPU is responsible for fetching instructions from memory, decoding them, executing them, and storing the results. It is the "brain" of the computer and performs all the calculations and operations required to run software and applications.



Input:  What is the difference between it and GPU?


Response: The main difference between a CPU and a GPU (Graphics Processing Unit) is their purpose and design. A CPU is designed to perform general-purpose computing tasks, such as running applications, web browsers, and operating systems. It is a "jack-of-all-trades" that can perform a wide range of tasks. On the other hand, a GPU is specifically designed to perform complex mathematical operations at high speeds, such as graphics rendering, scientific simulations, and machine learning. It is a "master-of-one" that excels at a specific set of tasks.

While a CPU can perform

Input:  stop


Stream Chat with Llama 2 (7B) stopped.


## 4.1.4 What's Next？

In the next tutorial, we will guide you through a speech recognition pipeline that incorporates BigDL-LLM INT4 optimizations.