Create an LLM chatbot using intel OpenVINO

In the fast-paced realm of artificial intelligence (AI), chatbots have become essential tools for businesses, enhancing customer interactions and streamlining operations. Large Language Models (LLMs) are advanced AI systems designed to comprehend and generate human language. By employing deep learning algorithms and vast datasets, they grasp the intricacies of language, producing coherent and relevant responses.

While traditional intent-based chatbots handle basic, single-touch inquiries such as order management, FAQs, and policy questions, LLM-powered chatbots excel at addressing more complex, multi-touch queries. These chatbots offer support in a conversational manner akin to human interactions, utilizing contextual memory. By harnessing the power of Language Models, chatbots are growing increasingly sophisticated, with an impressive ability to understand and respond to human language accurately.

Install required dependencies

In [1]:
import os
os.environ["GIT_CLONE_PROTECTION_ACTIVE"] = "false"
%pip install -Uq pip
%pip uninstall -q -y optimum optimum-intel
%pip install --pre -Uq openvino openvino-tokenizers[transformers] --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu\
"git+https://github.com/huggingface/optimum-intel.git"\
"git+https://github.com/openvinotoolkit/nncf.git"\
"torch>=2.1"\
"datasets" \
"accelerate"\
"gradio>=4.19"\
"onnx" "einops" "transformers_stream_generator" "tiktoken" "transformers>=4.38.1" "bitsandbytes"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import openvino as ov
core = ov.Core()
import os

In [3]:
devices = core.available_devices

for x in devices:
    device_name = core.get_property(x,"FULL_DEVICE_NAME")
    print(f"{x}: {device_name}")

CPU: Intel(R) Core(TM) i5-10300H CPU @ 2.50GHz
GPU: Intel(R) UHD Graphics (iGPU)


In [4]:
import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices,
    value=core.available_devices[0],
    description="Device:",
    disabled=False,
)

device

Dropdown(description='Device:', options=('CPU', 'GPU'), value='CPU')

#IMPORTANT BELOW

The model is read using read_model() and compiled using compile_model()

from pathlib import Path

model_id = "Intel/gpt2"
model_path = "gpt2-ov-int4"

if not Path(model_path).exists():
    !optimum-cli export openvino --model {model_id} --weight-format int8 {model_path}

In [5]:
%pip install --upgrade-strategy eager "optimum[openvino,nncf]" langchain-huggingface --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
from langchain_huggingface import HuggingFacePipeline

ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}

ov_llm = HuggingFacePipeline.from_model_id(
    model_id="gpt2",
    task="text-generation",
    backend="openvino",
    model_kwargs={"device": "CPU", "ov_config": ov_config},
    pipeline_kwargs={"max_new_tokens": 150},
)

Framework not specified. Using pt to export the model.
Using framework PyTorch: 2.3.1+cpu
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
Overriding 1 configuration item(s)
	- use_cache -> True
  if batch_size == 1 and attention_mask is not None and attention_mask[0, 0, -1, -1] < -1:
  if batch_size == 1 or self.training:
  if query_length > 1:


['input_ids', 'past_key_values', 'attention_mask', 'position_ids']


Compiling the model to CPU ...


In [3]:
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | ov_llm

question = "How big is an elephant"

print(chain.invoke({"question": question}))

Question: How big is an elephant

Answer: Let's think step by step. Let's start with the larger.

First, let's look at the top row. That is, the bottom row is the largest amount of animals to consider in terms of size.

First, we look for the largest animal. In order to figure out where the elephant is, all you need to do is check this list of all animals on this page and then the page's largest size. That means that this page had about 7,000 elephants, with a weight of 814 and a length of 8.1 miles. However, only a little over 100,000 of them are real elephants.

Next, let's consider the second row. This is a total of 4 elephants. Each of them weighs in at about


The below cell is to format the model and export it in ov_model_dir with an 8 bit precision

## Convert a Model Using the Optimum-CLI Tool
[Back to Top ⬆️](#Table-of-contents:)

🤗 [Optimum Intel](https://huggingface.co/docs/optimum/intel/index) serves as the bridge between the 🤗 [Transformers](https://huggingface.co/docs/transformers/index) and [Diffusers](https://huggingface.co/docs/diffusers/index) libraries and OpenVINO, facilitating the acceleration of end-to-end pipelines on Intel architectures. It offers a user-friendly CLI interface for exporting models to the [OpenVINO Intermediate Representation (IR)](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format.

The command below demonstrates a basic model export using `optimum-cli`:

```
optimum-cli export openvino --model <model_id_or_path> --task <task> <out_dir>
```

In this command:
- The `--model` argument specifies the model ID from the HuggingFace Hub or a local directory containing the model (saved using the `.save_pretrained` method).
- The `--task` argument specifies one of the [supported tasks](https://huggingface.co/docs/optimum/exporters/task_manager) that the exported model should perform. For LLMs, this would be `text-generation-with-past`.
- If model initialization requires remote code, the `--trust-remote-code` flag should also be included.

In [8]:
!optimum-cli export openvino --model gpt2  --weight-format int8 ov_model_dir

['input_ids', 'past_key_values', 'attention_mask', 'position_ids']
INFO:nncf:Statistics of the bitwidth distribution:
+----------------+-----------------------------+----------------------------------------+
|   Num bits (N) | % all parameters (layers)   | % ratio-defining parameters (layers)   |
|              8 | 100% (50 / 50)              | 100% (50 / 50)                         |
+----------------+-----------------------------+----------------------------------------+

Applying Weight Compression ---------------------   0% 0/50 • 0:00:00 • -:--:--
Applying Weight Compression ---------------------   0% 0/50 • 0:00:02 • -:--:--
Applying Weight Compression  --------------------   4% 2/50 • 0:00:05 • 0:01:30
Applying Weight Compression - -------------------   6% 3/50 • 0:00:09 • 0:02:05
Applying Weight Compression -- ------------------  10% 5/50 • 0:00:14 • 0:01:47
Applying Weight Compression -- ------------------  12% 6/50 • 0:00:20 • 0:01:52
Applying Weight Compression -- ----------

Framework not specified. Using pt to export the model.
Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past).
Using framework PyTorch: 2.3.1+cpu
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
Overriding 1 configuration item(s)
	- use_cache -> True
  if batch_size == 1 and attention_mask is not None and attention_mask[0, 0, -1, -1] < -1:
  if batch_size == 1 or self.training:
  if query_length > 1:


In [4]:
ov_llm = HuggingFacePipeline.from_model_id(
    model_id="ov_model_dir",
    task="text-generation",
    backend="openvino",
    model_kwargs={"device": "CPU", "ov_config": ov_config},
    pipeline_kwargs={"max_new_tokens": 160},
)

chain = prompt | ov_llm

question = "what is christmas"

print(chain.invoke({"question": question}))

Compiling the model to CPU ...


Question: what is christmas

Answer: Let's think step by step.

When I was first starting, my religion was "magnificent" but it was very different from my life with the church. In fact (the year 2000 was the great secular holiday, to me very well, and that included The Feast of St. John the Divine, the Christian holiday, which marked its day, by the very same days). So it was like a world-wide campaign. If we had been to look for a place, the church would have a much larger presence. I had to try a few things to get on board. This, however, was not the plan. I kept to my original plans of going to church each day, sometimes for one hour, sometimes two, always at night or on Sunday or the following day, for about fifteen of my religious


In [5]:
ov_config = {
    "KV_CACHE_PRECISION": "u8",
    "DYNAMIC_QUANTIZATION_GROUP_SIZE": "32",
    "PERFORMANCE_HINT": "LATENCY",
    "NUM_STREAMS": "1",
    "CACHE_DIR": "",
}

In [6]:
from threading import Thread

from transformers import TextIteratorStreamer

streamer = TextIteratorStreamer(
    ov_llm.pipeline.tokenizer,
    timeout=30.0,
    skip_prompt=True,
    skip_special_tokens=True,
)
pipeline_kwargs = {"pipeline_kwargs": {"streamer": streamer, "max_new_tokens": 100}}
chain = prompt | ov_llm.bind(**pipeline_kwargs)

t1 = Thread(target=chain.invoke, args=({"question": question},))
t1.start()

for new_text in streamer:
    print(new_text, end="", flush=True)

 I hope. It's something for the weekend as I didn't get the chance to see Christ on Friday, so I'll look back at it.


4. For those of you who wish that Christmas would not come, I'm really not sure. Why would you want to go out and expect it to come on Christmas day? Because it could potentially upset a couple of hundred people.


5. Well then, let's have a drink. You can have a drink and some music

In [7]:
import gradio as gr
import random

In [8]:
from langchain_huggingface import HuggingFacePipeline

ov_llm = HuggingFacePipeline.from_model_id(
    model_id="ov_model_dir",
    task="text-generation",
    backend="openvino",
    model_kwargs={"device": "CPU", "ov_config": ov_config},
    pipeline_kwargs={"max_new_tokens": 50},
)
from langchain_core.prompts import PromptTemplate

question = "How big is an elephant"

print(chain.invoke({"question": question}))

Compiling the model to CPU ...


Question: How big is an elephant

Answer: Let's think step by step. Once you've figured out what the elephant is, there is no end to how big.

When we consider what the elephant can handle, a typical elephant will be between 300 to 400 pounds, depending on how heavy the animal is. If the animal weighs more than 300 pounds a mile or more, we would call it an elephant, or more like 5,000 pounds. If the animal weighs a bit over a thousand pounds, another type of elephant is said to be around 100,000 pounds


Now we create function to handle the requests to and from the gradio application

In [13]:
def create(question):
    from threading import Thread
    from transformers import TextIteratorStreamer
    import random
    ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}

    ov_llm = HuggingFacePipeline.from_model_id(
        model_id="ov_model_dir",
        task="text-generation",
        backend="openvino",
        model_kwargs={"device": "CPU", "ov_config": ov_config},
        pipeline_kwargs={"max_new_tokens": random.randint(200,500)},
    )
    template = """Question: {question}

    Answer: Here is what I found useful\n"""
    prompt = PromptTemplate.from_template(template)

    chain = prompt | ov_llm

    return chain.invoke({"question": question})
    # streamer = TextIteratorStreamer(
    #     ov_llm.pipeline.tokenizer,
    #     timeout=30.0,
    #     skip_prompt=True,
    #     skip_special_tokens=True,
    # )
    # pipeline_kwargs = {"pipeline_kwargs": {"streamer": streamer, "max_new_tokens": 160}}
    # chain = prompt | ov_llm.bind(**pipeline_kwargs)

    # t1 = Thread(target=chain.invoke, args=({"question": question},))
    # t1.start()

    # s = ""
    # for new_text in streamer:
    #     s+=new_text
    # return s


def bot(input, history):
    history = history or []
    s = list(sum(history, ()))
    s.append(input)
    inp = ' '.join(s)
    output = create(input)
    history.append((input, output))
    return history, history

In [14]:
custom_css = """
@import url('https://fonts.googleapis.com/css2?family=Unbounded:wght@600&display=swap');
@import url('https://fonts.googleapis.com/css2?family=Unbounded&display=swap');
h1 {
    font-family: Unbounded;
    font-size: 35px;
}
.desc {
    font-family: Unbounded;
    font-size: 10px;
}
div {
    font-family: Unbounded;
    font-size: 21px;
}
.progress-text {
    font-family: Unbounded;
    font-size: 10px;
}
pre{
    white-space: pre-wrap;
    overflow-wrap: break-word;
}
"""

In [15]:
block = gr.Blocks(css=custom_css)
prompt = "The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly."

## Run Chatbot
Now, when model created, we can setup Chatbot interface using [Gradio](https://www.gradio.app/).

In [16]:
with block:
    gr.Markdown("""<h1><center><img src="https://prosza.000webhostapp.com/assets/logo-removebg.png" style="height:100px; width:100px;filter: invert(100%);">ZEN(2) ChatBot</center></h1>
    <div class='desc'><center>(This model is created using gpt2 and is running on the edge with INT8 precision; weight compression done using optimum-Intel/NNCF.)</center></div>
    """)
    chatbot = gr.Chatbot()
    message = gr.Textbox(placeholder=prompt)
    state = gr.State()
    submit = gr.Button("SEND")
    submit.click(bot, inputs=[message, state], outputs=[chatbot, state])

block.launch(debug = True)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Compiling the model to CPU ...
Compiling the model to CPU ...
Compiling the model to CPU ...
Compiling the model to CPU ...


Keyboard interruption in main thread... closing server.


