<img src="images/ragna-logo.png" width="200px" align="right"/>

# Set up an offline Large Language Model

<hr>

## What is a Large Language Model (LLM)?

A "language model" is a machine learning model designed to understand and generate (predict) natural language. For example, auto-completion of text in input fields often use language models.

A "large language model" is a language model based on the [Transformer architecture](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)), trained on large amounts of (general) data and consists of several billion parameters. With this scale and complexity, LLMs are capable of various text processing and generation tasks like document summarization, answering common questions, text-based content creation.

Popular LLMs include Open AI's GPT, Google's Gemini, Anthropic's Claude, and more.

## What is a "local" or "offline" LLM

Large Language Models (LLMs) like OpenAI's GPT are proprietary, can only be accessed through the OpenAI API or services like ChatGPT. While easy to use, these can be concerning for data privacy, vendor lock-in, and cost-related reasons.

Offline, local, or open weight LLMs are models that can be self-hosted on your local computers.

Today, we're running it on a cloud platform, but each of you have access to essentially an individual machine. This allows us to have a standard tutorial environment.

## LLM: Llama3

In this tutorial, we'll use the [Llama3-8B model](https://ai.meta.com/blog/meta-llama-3/), which is released by Meta under a permissive license.

### Quantization

> Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).
> 
> Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.
> 
> ~ [Hugging Face Documentation](https://huggingface.co/docs/optimum/en/concept_guides/quantization)


In its original float32 representation, an LLM needs roughly 4 GB of VRAM for each billion of parameters. For example, LLama3-8B with its roughly 8 billion parameters needs 32GB to load the weights.

By quantizing the float32 representation into a lower number `n` of bits per weight (bpw), this can be drastically reduced to `n / 8` GB of VRAM for each billion of parameters. For example, with 6 bpw, which is what we are going to use, we only need 6 GB to load the weights.

There are a number of quantization schemes / file formats (`exl2`, `gtpq`, `gguf`, `awq`) and libraries (Exllamav2, llama.cpp, AutoGPTQ) to create and use quantized weights. For this tutorial we are going to use Exllamav2 with the corresponding `exl2` weights.

## Exllamav2

`exllamav2` is a quantization and inference library.

We have downloaded the quantized versions of Mistral 7B in the `shared/analyst/models` directory, available from the root of your Nebari file system.

### Side note: Instructions for local users 💻

To download and use the model on your local computer (i.e., outside this tutorial at PyCon using Nebari):

1. Install `Exllamav2` with the [instructions in the project repository](https://github.com/turboderp/exllamav2#installation).
2. In your local terminal: `git lfs install` and `git clone https://huggingface.co/turboderp/Llama-3-8B-Instruct-exl2`
3. View all branches:`git branch --all`
4. Check-out the branch with relevant weights `git checkout remotes/origin/6.0bpw`
5. Note the model directory path, and use it in the inference and Ragna scripts

## Use Llama3 for inference

Let's run the [example inference script](https://github.com/turboderp/exllamav2/blob/master/examples/inference.py).

### Imports

In [None]:
import time
from pathlib import Path

import torch
from exllamav2 import ExLlamaV2, ExLlamaV2Cache, ExLlamaV2Config, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

### Initialize model and cache

In [None]:
models_directory = Path.home() / "shared/pycon/models"
model_directory = models_directory / "turboderp/Llama-3-8B-Instruct-exl2"

print(f"Loading model: {model_directory}")

In [None]:
config = ExLlamaV2Config()
config.model_dir = model_directory
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)

### Initialize generator

In [None]:
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

### Generate some text

In [None]:
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.01
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

In [None]:
prompt = "Our story begins in PyCon DE, where once"

max_new_tokens = 150

generator.warmup()
time_begin = time.time()

output = generator.generate_simple(prompt, settings, max_new_tokens, seed=1234)

time_end = time.time()
time_total = time_end - time_begin

print(output)
print()
print(
    f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second"
)

Instead of waiting for the full generation to complete, we can also stream the answer back in individual generated chunks:

In [None]:
from exllamav2.generator import ExLlamaV2Sampler, ExLlamaV2StreamingGenerator

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

generator.begin_stream_ex(tokenizer.encode(prompt), settings)

print(prompt, end="")
for _ in range(max_new_tokens):
    result = generator.stream_ex()
    if result["eos"]:
        break
    print(result["chunk"], end="")
print()

<hr>

_❗️ **Warning:** Make sure to stop the Jupyter Kernel (in the JupyterLab Menu Bar, click on "Kernel" -> "Shut down Kernel") before proceeding to prevent the "insufficient VRAM" error._

<br>

**✨ Next: [Basics of RAG-powered chat app](02-rag-basics.ipynb) →**

<br>

💬 _Wish to continue discussions after the tutorial? Contact the presenters: [@pavithraes](https://github.com/pavithraes), [@dharhas](https://github.com/dharhas), [@ahuang11](https://github.com/ahuang11)_

<hr>