# Local LLMs with Panel

In the [previous notebook](04-panel-intro.ipynb), `pn.chat.ChatInterface` was introduced with a callback that simply echoed the sent message.

In this section, we will make it much more interesting by connecting a local LLM, specfically Llama-3 from earlier.

## ExLlama2

### Initialize

Let's first initialize the model.

In [None]:
from local_llm import Llama38BInstruct
from ragna import Rag, source_storages
import panel as pn
pn.extension()

documents = [
    "files/psf-report-2021.pdf",
    "files/psf-report-2022.pdf",
]

chat = Rag().chat(
    documents=documents,
    source_storage=source_storages.Chroma,
    assistant=Llama38BInstruct,
)

await chat.prepare();

### Migrate

We can first do a test run to see if it works with the example from before.

In [None]:
message = await chat.answer("Who is the Python Developer in Residence?", stream=True)

async for chunk in message:
    print(chunk, end="")

Now, let's migrate this functionality into `pn.chat.ChatInterface` with a callback.

To do this, we copy paste the prior cell's code into a function, and then:

1. prefix the `def` with `async` to make it async
2. replace the hard-coded string with `contents`
3. concatenate the chunks into a `response` string
4. yield the `response`

In [None]:
async def reply(contents, user, instance):
    message = await chat.answer(contents, stream=True)

    response = ""
    async for chunk in message:
        response += chunk
        yield response

chat_interface = pn.chat.ChatInterface(callback=reply)
chat_interface

Now try entering "Who is the Python Developer in Residence?" into the chat. It should give you a similar response as before!

## LlamaCpp

For posterity, we can use `llama-cpp-python` for quantized models too!

`llama-cpp` can run on both CPU and GPU, and has an API that mimics OpenAI's API.

Personally, I use it because I don't have any spare GPUs lying around and it runs extremely well on my local Mac M2 Pro! It also handles chat template formats internally so it's just a matter of specifying a the proper `chat_format` key.

Here, we:
1. download the quantized model (if it doesn't exist already) in GGUF format
2. instantiate the model; first checking the cache
3. serialize all messages into `transformers` format (new)
4. calls the chat completion Openai-like API on the messages
5. stream the chunks

In [None]:
from pathlib import Path

import llama_cpp
import panel as pn
from huggingface_hub import hf_hub_download
pn.extension()

model_path = hf_hub_download(
    "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    "mistral-7b-instruct-v0.2.Q5_K_M.gguf",
    local_dir=str(Path.home() / "shared/pycon/models")
)  # 1.

# 2.
if model_path in pn.state.cache:
    llama = pn.state.cache[model_path]
else:
    llama = llama_cpp.Llama(
        model_path=model_path,
        n_gpu_layers=-1,
        chat_format="mistral-instruct",
        n_ctx=2048,
        logits_all=True,
        verbose=False,
    )
    pn.state.cache[model_path] = llama

def reply(contents: str, user: str, instance: pn.chat.ChatInterface):
    messages = instance.serialize()  # 3.
    message = llama.create_chat_completion_openai_v1(messages=messages, stream=True)  # 4.

    response = ""
    for chunk in message:
        part = chunk.choices[0].delta.content or ""
        response += part
        yield response  # 5.

chat_interface = pn.chat.ChatInterface(callback=reply)
chat_interface

We can even give the model a personality by setting a system message!

Update the callback with the a system message.

Note, Mistral Instruct does NOT support the `system` role so we use `user` instead.

In [None]:
system_message = "You are an excessively passionate Pythonista."

def reply(contents: str, user: str, instance: pn.chat.ChatInterface):
    messages = [
        {"role": "user", "content": system_message}  # updated here
    ] + instance.serialize()
    message = llama.create_chat_completion_openai_v1(messages=messages, stream=True)

    response = ""
    for chunk in message:
        part = chunk.choices[0].delta.content or ""
        response += part
        yield response

chat_interface.callback = reply
chat_interface

### Challenge

Your turn! Try aggregating all you've learned to customize the personality of the chatbot on the go!

Again, replace the ellipses with the appropriate code snippets!

In [None]:
import llama_cpp
import panel as pn
from pydantic import BaseModel
from huggingface_hub import hf_hub_download

pn.extension()

model_path = hf_hub_download(
    "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    "mistral-7b-instruct-v0.2.Q5_K_M.gguf",
    local_dir=str(Path.home() / "shared/analyst/models")
)

if model_path in pn.state.cache:
    llama = pn.state.cache[model_path]
else:
    llama = llama_cpp.Llama(
        model_path=model_path,
        n_gpu_layers=-1,
        chat_format="mistral-instruct",
        n_ctx=2048,
        logits_all=True,
        verbose=False,
    )
    pn.state.cache[model_path] = llama

def reply(contents: str, user: str, instance: pn.chat.ChatInterface):
    messages = [
        {"role": "user", "content": ...}  # Fill this out
    ] + instance.serialize()
    message = llama.create_chat_completion_openai_v1(
        messages=messages, stream=True
    )

    response = ""
    for chunk in message:
        part = chunk.choices[0].delta.content or ""
        response += part
        yield response


system_input = ...  # Fill this out
chat_interface = pn.chat.ChatInterface(callback=reply, min_height=350)
layout = pn.Column(
    system_input,
    chat_interface,
)
layout

That's all for now. Click [here](https://holoviz-topics.github.io/panel-chat-examples/) to see more on how you can integrate `pn.chat.ChatInterface` with other services!

Again, there is also a HoloViz Discourse if you want to ask questions [here](https://discourse.holoviz.org/).

<hr>

_❗️ **Warning:** Make sure to stop the Jupyter Kernel (in the JupyterLab Menu Bar, click on "Kernel" -> "Shut down Kernel") before proceeding to prevent the "insufficient VRAM" error._

<br>

**✨ Next: [UI and Experiments](06-UI-and-experiments.ipynb) →**


💬 _Wish to continue discussions after the tutorial? Contact the presenters: [@pavithraes](https://github.com/pavithraes), [@dharhas](https://github.com/dharhas), [@ahuang11](https://github.com/ahuang11)_

<hr>