<img src="images/ragna-logo.png" width="200px" align="right"/>

# Use Local LLM with Ragna

<hr>

## Create a new Ragna assistant

To use a local LLM in Ragna, we have to subclass the `ragna.core.Assistant` abstract base class. The only abstract method that we have to overwrite is [`.answer()`](https://ragna.chat/en/stable/references/python-api/#ragna.core.Assistant.answer). It gets passed the `prompt` of the user as well as the `sources` retrieved from the source storage. In there we combine these two parts of information into one large prompt for the LLM, start the generation, and `yield` back the individual chunks.

<details>
<summary> <b>Expand to read <code>local_llm.py</code> → </b></summary>

```python
from pathlib import Path
from typing import Iterator

from ragna.core import Assistant, PackageRequirement, Source

class Llama38BInstruct(Assistant):
    @classmethod
    def display_name(cls):
        return "turboderp/Llama-3-8B-Instruct-exl2"

    @classmethod
    def requirements(cls):
        return [
            PackageRequirement("torch"),
            PackageRequirement("exllamav2"),
        ]

    @classmethod
    def is_available(cls):
        requirements_available = super().is_available()
        if not requirements_available:
            return False

        import torch

        return torch.cuda.is_available()

    def __init__(self):
        super().__init__()
        from exllamav2 import (
            ExLlamaV2,
            ExLlamaV2Cache,
            ExLlamaV2Config,
            ExLlamaV2Tokenizer,
        )
        from exllamav2.generator import ExLlamaV2Sampler, ExLlamaV2StreamingGenerator

        config = ExLlamaV2Config()
        config.model_dir = str(Path.home() / "shared/analyst/models" / self.display_name())
        config.prepare()

        self.tokenizer = ExLlamaV2Tokenizer(config)

        model = ExLlamaV2(config)
        cache = ExLlamaV2Cache(model, lazy=True)
        model.load_autosplit(cache)
        self.generator = ExLlamaV2StreamingGenerator(model, cache, self.tokenizer)
        self.generator.set_stop_conditions({self.tokenizer.eos_token_id})

        self.settings = ExLlamaV2Sampler.Settings()
        self.settings.temperature = 0.0

    def _make_prompt(self, prompt: str, sources: list[Source]) -> str:
        return "\n".join(
            [
                f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>",
                f"",
                f"Answer the question based only on the following context:",
                *[source.content for source in sources],
                f"<|eot_id|><|start_header_id|>user<|end_header_id|>",
                f"",
                f"{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
            ]
        )

    def answer(
        self, prompt: str, sources: list[Source], *, max_new_tokens: int = 256
    ) -> Iterator[str]:
        input_ids = self.tokenizer.encode(
            self._make_prompt(prompt, sources), add_bos=False
        )

        self.generator.begin_stream_ex(input_ids, self.settings)

        for _ in range(max_new_tokens):
            result = self.generator.stream_ex()
            if result["eos"]:
                break
            yield result["chunk"]
```

</details>


Note that apart from the prompt generation in `._make_prompt()`, the code is actually model agnostic and can be used with any LLM with `exl2` weights.

## Use the assistant

In [None]:
from local_llm import Llama38BInstruct

In [None]:
Llama38BInstruct.display_name()

In [None]:
Llama38BInstruct.is_available()

In [None]:
from ragna import Rag, source_storages

documents = [
    "files/psf-report-2021.pdf",
    "files/psf-report-2022.pdf",
]

chat = Rag().chat(
    documents=documents,
    source_storage=source_storages.Chroma,
    assistant=Llama38BInstruct,
)

await chat.prepare();

In [None]:
message = await chat.answer("Who is the Python Developer in Residence?", stream=True)

async for chunk in message:
    print(chunk, end="")

<hr>

_❗️ **Warning:** Make sure to stop the Jupyter Kernel (in the JupyterLab Menu Bar, click on "Kernel" -> "Interrupt Kernel") before proceeding._

<br>

**✨ Next: [Basics of RAG-powered chat app](02-rag-basics.ipynb) →**

<hr>