TINY_LLAMA2_CHATBOT USING OPENVINO TOOLKIT

Installing the necessary packages

In [1]:
!pip install openvino
!pip install optimum[openvino]
!pip install transformers
!pip install nncf
!pip install gradio

Collecting openvino
  Downloading openvino-2024.2.0-15519-cp310-cp310-manylinux2014_x86_64.whl (38.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.7/38.7 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
Collecting openvino-telemetry>=2023.2.1 (from openvino)
  Downloading openvino_telemetry-2024.1.0-py3-none-any.whl (23 kB)
Installing collected packages: openvino-telemetry, openvino
Successfully installed openvino-2024.2.0 openvino-telemetry-2024.1.0
Collecting optimum[openvino]
  Downloading optimum-1.21.2-py3-none-any.whl (424 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m424.7/424.7 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting coloredlogs (from optimum[openvino])
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from optimum[openvino])
  Downloading datasets-2.20.0-py3-n

In [2]:
!pip install transformers>=4.34


In [3]:
!pip install accelerate==0.21.0
!pip install transformers torch

Collecting accelerate==0.21.0
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.21.0


Selecting the model for inference from huggingface

In [4]:
from huggingface_hub import notebook_login, whoami

try:
    whoami()
    print('Authorization token already provided')
except OSError:
    notebook_login()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Loading the model directly from huggingface

tiny-llama-1b-chat - This is the chat model finetuned on top of TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens with the adoption of the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.

In [5]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
# optimum-cli export openvino --model model --task text-generation-with-past /content/model/

Converting the model into OpenVino IR format using Optimum-CLI tool

In [6]:
import subprocess
subprocess.run([
    "optimum-cli", "export", "openvino",
    "--model", "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "--task", "text-generation-with-past",
    "/content/model/"
])

CompletedProcess(args=['optimum-cli', 'export', 'openvino', '--model', 'TinyLlama/TinyLlama-1.1B-Chat-v1.0', '--task', 'text-generation-with-past', '/content/model/'], returncode=0)

In [None]:
model.save_pretrained("./models/optimum_model")

Example of the model response

In [8]:
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])


<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>
It is impossible to have a human who can eat an infinite amount of helicopters without getting hungry. The number of helicopters that a human can eat in one sitting depends on their appetite and the amount of food they consume in one sitting. However, studies have shown that some people can eat up to 300 helicopters in a single sitting. It is recommended that a human consume a meal that is equivalent to 1-2 servings of helicopters before consuming anything else.


Compressing the model weights using Optimum-CLI

In [9]:
from pathlib import Path
from IPython.display import Markdown, display

# Define the model name and compression parameters
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
int4_model_dir = Path(model_name) / "INT4_compressed_weights"

def convert_to_int4():
    compression_configs = {
        "TinyLlama/TinyLlama-1.1B-Chat-v1.0": {
            "sym": True,
            "group_size": 128,
            "ratio": 0.8,
        },
        "default": {
            "sym": False,
            "group_size": 128,
            "ratio": 0.8,
        },
    }

    model_compression_params = compression_configs.get(model_name, compression_configs["default"])
    if (int4_model_dir / "openvino_model.xml").exists():
        return
    remote_code = False  # Assuming remote_code is False for simplicity
    export_command_base = f"optimum-cli export openvino --model {model_name} --task text-generation-with-past --weight-format int4"
    int4_compression_args = f" --group-size {model_compression_params['group_size']} --ratio {model_compression_params['ratio']}"
    if model_compression_params["sym"]:
        int4_compression_args += " --sym"
    export_command_base += int4_compression_args
    if remote_code:
        export_command_base += " --trust-remote-code"
    export_command = export_command_base + " " + str(int4_model_dir)
    display(Markdown("**Export command:**"))
    display(Markdown(f"`{export_command}`"))
    ! $export_command

# Perform int4 compression
convert_to_int4()


**Export command:**

`optimum-cli export openvino --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.8 --sym TinyLlama/TinyLlama-1.1B-Chat-v1.0/INT4_compressed_weights`

2024-07-05 20:17:45.791632: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-05 20:17:45.791696: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-05 20:17:45.908713: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Framework not specified. Using pt to export the model.
Using framework PyTorch: 2.3.0+cu121
Overriding 1 configuration item(s)
	- use_cache -> True
  if sequence_length != 1:
[2KMixed-Precision assignment [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [36m154/154[0m • [36m0:00:18[0m

In [10]:
int4_weights = int4_model_dir / "openvino_model.bin"

if int4_weights.exists():
    print(f"Size of model with INT4 compressed weights is {int4_weights.stat().st_size / 1024 / 1024:.2f} MB")

Size of model with INT4 compressed weights is 693.31 MB


In [11]:
import openvino as ov
import ipywidgets as widgets

# Create an instance of the OpenVINO Core class
core = ov.Core()

# Get the list of available devices
support_devices = core.available_devices

# If "NPU" is in the list of available devices, remove it
if "NPU" in support_devices:
    support_devices.remove("NPU")

# Default device set to CPU
default_device = "CPU"

# Create the device dropdown
device = widgets.Dropdown(
    options=support_devices + ["AUTO"],
    value=default_device,
    description="Device:",
    disabled=False,
)

device


Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')

In [12]:
available_models = []
if int4_model_dir.exists():
    available_models.append("INT4")

model_to_run = widgets.Dropdown(
    options=available_models,
    value=available_models[0],
    description="Model to run:",
    disabled=False,
)

model_to_run

Dropdown(description='Model to run:', options=('INT4',), value='INT4')

Creating an interface using gradio

In [13]:
import torch
from threading import Event, Thread
from uuid import uuid4
from typing import List, Tuple
import gradio as gr
from transformers import StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer

# Assuming model and tokenizer are already loaded and compressed
# model = your_preloaded_model
# tokenizer = your_preloaded_tokenizer

model_configuration = {
    "model_id": "TinyLlama-1.1B-Chat-v1.0",
    "start_message": "Hello! How can I assist you today?",
    "history_template": None,
    "current_message_template": None,
    "stop_tokens": ["<EOS>"],
    "tokenizer_kwargs": {},
    "model_language": "English",
}

model_language = model_configuration["model_language"]
model_name = model_configuration["model_id"]
start_message = model_configuration["start_message"]
history_template = model_configuration.get("history_template")
current_message_template = model_configuration.get("current_message_template")
stop_tokens = model_configuration.get("stop_tokens")
tokenizer_kwargs = model_configuration.get("tokenizer_kwargs", {})

examples = {
    # "Chinese": [
    #     ["你好!"],
    #     ["你是谁?"],
    #     ["请介绍一下上海"],
    #     ["请介绍一下英特尔公司"],
    #     ["晚上睡不着怎么办？"],
    #     ["给我讲一个年轻人奋斗创业最终取得成功的故事。"],
    #     ["给这个故事起一个标题。"],
    # ],
    "English": [
        ["Hello there! How are you doing?"],
        ["What is OpenVINO?"],
        ["Who are you?"],
        ["Can you explain to me briefly what is Python programming language?"],
        ["What are some common mistakes to avoid when writing code?"],
        ["Write a 100-word blog post on “Benefits of Artificial Intelligence and OpenVINO“"],
    ],
    # "Japanese": [
    #     ["こんにちは！調子はどうですか?"],
    #     ["OpenVINOとは何ですか?"],
    #     ["あなたは誰ですか?"],
    #     ["Pythonプログラミング言語とは何か簡単に説明してもらえますか?"],
    #     ["シンデレラのあらすじを一文で説明してください。"],
    #     ["コードを書くときに避けるべきよくある間違いは何ですか?"],
    #     ["人工知能と「OpenVINOの利点」について100語程度のブログ記事を書いてください。"],
    # ],
}[model_language]

max_new_tokens = 256

class StopOnTokens(StoppingCriteria):
    def __init__(self, stop_token_ids: List[int]):
        self.stop_token_ids = stop_token_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_id in self.stop_token_ids:
            if stop_id in input_ids[0][-len(self.stop_token_ids):].tolist():
                return True
        return False

def user_input(message, history):
    return "", history + [[message, None]]

def bot_response(history, temperature, top_p, top_k, repetition_penalty, conversation_id):
    stopping_criteria = StoppingCriteriaList([StopOnTokens(stop_token_ids=tokenizer.convert_tokens_to_ids(stop_tokens))])
    streamer = TextIteratorStreamer(tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)
    input_ids = tokenizer.encode(history[-1][0], return_tensors="pt")
    thread = Thread(target=model.generate, kwargs={
        'inputs': input_ids,
        'max_new_tokens': max_new_tokens,
        'temperature': temperature,
        'top_p': top_p,
        'top_k': top_k,
        'repetition_penalty': repetition_penalty,
        'stopping_criteria': stopping_criteria,
        'streamer': streamer,
    })
    thread.start()

    response = ""
    for new_text in streamer:
        response += new_text
        history[-1][1] = response
        yield history

    history[-1][1] = response
    return history

def request_cancel():
    global stop_thread
    stop_thread.set()
    stop_thread = Event()

with gr.Blocks() as demo:
    conversation_id = gr.State(str(uuid4()))
    gr.Markdown(f"<h1 align='center'>{model_name}</h1>")
    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    with gr.Row():
        submit = gr.Button("Submit")
        stop = gr.Button("Stop")
        clear = gr.Button("Clear")

    with gr.Accordion("Advanced Options:", open=False):
        with gr.Column():
            with gr.Row():
                temperature = gr.Slider(label="Temperature", value=0.7, minimum=0.0, maximum=1.0, step=0.1)
                top_p = gr.Slider(label="Top-p (nucleus sampling)", value=0.95, minimum=0.0, maximum=1.0, step=0.01)
                top_k = gr.Slider(label="Top-k", value=50, minimum=1, maximum=100, step=1)
            with gr.Row():
                repetition_penalty = gr.Slider(label="Repetition Penalty", value=1.1, minimum=1.0, maximum=2.0, step=0.1, info="Penalize repetition — 1.0 to disable.")

    gr.Examples(examples, inputs=msg, label="Click on any example and press the 'Submit' button")

    submit_event = msg.submit(fn=user_input, inputs=[msg, chatbot], outputs=[msg, chatbot], queue=False).then(
        fn=bot_response, inputs=[chatbot, temperature, top_p, top_k, repetition_penalty, conversation_id], outputs=chatbot, queue=True)
    submit_click_event = submit.click(fn=user_input, inputs=[msg, chatbot], outputs=[msg, chatbot], queue=False).then(
        fn=bot_response, inputs=[chatbot, temperature, top_p, top_k, repetition_penalty, conversation_id], outputs=chatbot, queue=True)
    stop.click(fn=request_cancel, inputs=None, outputs=None, cancels=[submit_event, submit_click_event], queue=False)
    clear.click(lambda: None, None, chatbot, queue=False)

# To launch the interface
# demo.launch(server_name='your_server_name', server_port=your_server_port)
# For sharing
# demo.launch(share=True)

demo.launch()



Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://8c4cc7852857496f36.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


