<img src="images/ragna-logo.png" width=15% align="right"/>

# Set up an offline Large Language Model

<hr>

## What is a Large Language Model (LLM)?

A "language model" is a machine learning model designed to understand and generate (predict) natural language. For example, auto-completion of text in input fields often use language models.

A "large language model" is a language model based on the [Transformer architecture](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)), trained on large amounts of (general) data and consists of several million to billion parameters. With this scale and complexity, LLMs are capable of various text processing and generation tasks like document summarization, answering common questions, text-based content creation.

Popular LLMs include Open AI's GPT, Google's Gemini, Anthropic's Claude, and more.

## What is a "local" or "offline" LLM

Large Language Models (LLMs) like OpenAI's GPT are proprietary, can only be accessed through the OpenAI API or services like ChatGPT. While easy to use, these can be concerning for data privacy, vendor lock-in, and cost-related reasons.

Offline, local, or open weight LLMs are models that can be self-hosted on your local computers.

Today, we're running it on a cloud platform, but each of you have access to essentially an individual machine. This allows us to have a standard tutorial environment.

## LLM: Mistral 7B

In this tutorial, we'll use the Mistral 7B model, which is released under the Apache 2.0 license.

This is a well performing and popular model for offline use. Learn more at [mistral.ai](https://mistral.ai/news/announcing-mistral-7b/).

### Quantization

> Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).
> 
> Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.
> 
> ~ [Hugging Face Documentation](https://huggingface.co/docs/optimum/en/concept_guides/quantization)


## Exllamav2

A quantization and inference library.

To download this locally,

1. Install `Exllamav2`:
2. In the terminal: `git lfs install` and `git clone https://huggingface.co/turboderp/Mistral-7B-v0.2-exl2`
3. View all branches:`git branch --all`
4. Check-out the weights of your choice `git checkout remotes/origin/2.5bpw`
5. Note the model directory path, and use it in the scripts

Let's run the example inference script: https://github.com/turboderp/exllamav2/blob/master/examples/inference.py

In [14]:
import torch
torch.cuda.is_available()

True

In [11]:
from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import (
    ExLlamaV2BaseGenerator,
    ExLlamaV2Sampler
)

import time

In [12]:
# Initialize model and cache

from pathlib import Path

dir_relative_path = "shared/developer/pavithraes/Mistral-7B-v0.2-exl2/"

model_directory =  str(Path.home() / dir_relative_path)
print("Loading model: " + model_directory)

Loading model: /home/peswaramoorthy@quansight.com/shared/developer/pavithraes/Mistral-7B-v0.2-exl2


In [15]:
config = ExLlamaV2Config()

config.model_dir = model_directory

config.prepare()

In [16]:
model = ExLlamaV2(config)

In [17]:
cache = ExLlamaV2Cache(model, lazy = True)

In [18]:
model.load_autosplit(cache)

In [19]:
tokenizer = ExLlamaV2Tokenizer(config)

In [20]:
# Initialize generator

generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

In [21]:
# Generate some text

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.85
settings.top_k = 50
settings.top_p = 0.8
settings.token_repetition_penalty = 1.01
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

prompt = "Our story begins in the Scottish town of Auchtermuchty, where once"

max_new_tokens = 150

generator.warmup()
time_begin = time.time()

output = generator.generate_simple(prompt, settings, max_new_tokens, seed = 1234)

time_end = time.time()
time_total = time_end - time_begin

print(output)
print()
print(f"Response generated in {time_total:.2f} seconds, {max_new_tokens} tokens, {max_new_tokens / time_total:.2f} tokens/second")

Our story begins in the Scottish town of Auchtermuchty, where once, long ago, in the early 1900s, there was a small shop called C.S. Robertson & Co. The shop sold a lot of things, but the most important thing it sold was coal. And in a town like Auchtermuchty, with its harsh winters and its poor housing, coal was essential. It warmed the homes of the townspeople. It heated the factories. It kept the wheels of industry turning.

But coal was not cheap. And coal was not easy to come by. In those days, Auchtermuchty was a poor town. Many of its people lived in cramped, damp, unheated houses. Many of its people had trouble making ends

Response generated in 8.69 seconds, 150 tokens, 17.26 tokens/second


<hr>

**✨ Next: [Basics of RAG-powered chat app](02-rag-basics.ipynb) →**

<hr>