<a href="https://colab.research.google.com/github/NID123-CH/LLM-Codes/blob/main/06_Exercise_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Quantize and Serve a Model

In [None]:
!pip install accelerate bitsandbytes peft

## Clone llama.cpp (Version 2760) / Install GGUF-PY

In [None]:
!git clone -b b2760 --single-branch https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp/gguf-py && pip install .

Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting peft
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-ma

## Supported Models

In [None]:
import gguf
list(gguf.MODEL_ARCH)

[<MODEL_ARCH.LLAMA: 1>,
 <MODEL_ARCH.FALCON: 2>,
 <MODEL_ARCH.BAICHUAN: 3>,
 <MODEL_ARCH.GROK: 4>,
 <MODEL_ARCH.GPT2: 5>,
 <MODEL_ARCH.GPTJ: 6>,
 <MODEL_ARCH.GPTNEOX: 7>,
 <MODEL_ARCH.MPT: 8>,
 <MODEL_ARCH.STARCODER: 9>,
 <MODEL_ARCH.PERSIMMON: 10>,
 <MODEL_ARCH.REFACT: 11>,
 <MODEL_ARCH.BERT: 12>,
 <MODEL_ARCH.NOMIC_BERT: 13>,
 <MODEL_ARCH.BLOOM: 14>,
 <MODEL_ARCH.STABLELM: 15>,
 <MODEL_ARCH.QWEN: 16>,
 <MODEL_ARCH.QWEN2: 17>,
 <MODEL_ARCH.QWEN2MOE: 18>,
 <MODEL_ARCH.PHI2: 19>,
 <MODEL_ARCH.PHI3: 20>,
 <MODEL_ARCH.PLAMO: 21>,
 <MODEL_ARCH.CODESHELL: 22>,
 <MODEL_ARCH.ORION: 23>,
 <MODEL_ARCH.INTERNLM2: 24>,
 <MODEL_ARCH.MINICPM: 25>,
 <MODEL_ARCH.GEMMA: 26>,
 <MODEL_ARCH.STARCODER2: 27>,
 <MODEL_ARCH.MAMBA: 28>,
 <MODEL_ARCH.XVERSE: 29>,
 <MODEL_ARCH.COMMAND_R: 30>,
 <MODEL_ARCH.DBRX: 31>,
 <MODEL_ARCH.OLMO: 32>]

## Load Your Model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

name = "Locutusque/gpt2-large-medical"
tokenizer = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(name,
                                             torch_dtype=torch.float16,
                                             trust_remote_code=True,
                                             device_map={"": 0})

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/197 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/915 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [None]:
model.__class__

## Save Model to Disk for Conversion

In [None]:
!rm -rf ./model && mkdir model
model.save_pretrained('./model')
tokenizer.save_pretrained('./model')

('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.json',
 './model/merges.txt',
 './model/added_tokens.json',
 './model/tokenizer.json')

## Convert Model to GGUF

In [None]:
!python ./llama.cpp/convert-hf-to-gguf.py ./model

Loading model: model
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
gguf: Setting special token type pad to 50257
Exporting model to 'model/ggml-model-f16.gguf'
gguf: loading model part 'model.safetensors'
blk.0.attn_qkv.bias, n_dims = 1, torch.float16 --> float32
blk.0.attn_qkv.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_output.bias, n_dims = 1, torch.float16 --> float32
blk.0.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_norm.bias, n_dims = 1, torch.float16 --> float32
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.0.ffn_norm.bias, n_dims = 1, torch.float16 --> float32
blk.0.ffn_norm.weight, n_dims = 1, torch.

## Builds llama.cpp for Quantization

In [None]:
!cd llama.cpp && make clean && make
!pip install -r llama.cpp/requirements.txt

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

rm -vrf *.o tests/*.o *.so *.a *.dll be

## Quantized GGUF Model

In [None]:
!./llama.cpp/quantize ./model/ggml-model-f16.gguf ./model/ggml-model-q8_0.gguf q8_0

main: build = 2760 (3f16747)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing './model/ggml-model-f16.gguf' to './model/ggml-model-q8_0.gguf' as Q8_0
llama_model_loader: loaded meta data with 17 key-value pairs and 437 tensors from ./model/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt2
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                           gpt2.block_count u32              = 36
llama_model_loader: - kv   3:                        gpt2.context_length u32              = 1024
llama_model_loader: - kv   4:                      gpt2.embedding_length u32              = 1280
llama_model_loader: - kv   5:                   gpt2.feed_forward_length u32        

## Install Ollama

In [None]:
!curl https://ollama.ai/install.sh | sh
!pip install ollama

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Downloading ollama...
100 11868    0 11868    0     0  28989      0 --:--:-- --:--:-- --:--:-- 29017
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Collecting ollama
  Downloading ollama-0.3.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx<0.28.0,>=0.27.0 (from ollama)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<0.28.0,>=0.27.0->ollama)

## Helper Functions for Serving Ollama

In [None]:
# https://stackoverflow.com/questions/77697302/how-to-run-ollama-in-google-colab

import os
import asyncio

# NB: You may need to set these depending and get cuda working depending which backend you are running.
# Set environment variable for NVIDIA library
# Set environment variables for CUDA
os.environ['PATH'] += ':/usr/local/cuda/bin'
# Set LD_LIBRARY_PATH to include both /usr/lib64-nvidia and CUDA lib directories
os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'

async def run_process(cmd):
    print('>>> starting', *cmd)
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    # define an async pipe function
    async def pipe(lines):
        async for line in lines:
            print(line.decode().strip())

        await asyncio.gather(
            pipe(process.stdout),
            pipe(process.stderr),
        )

    # call it
    await asyncio.gather(pipe(process.stdout), pipe(process.stderr))

In [None]:
import asyncio
import threading

async def start_ollama_serve():
    await run_process(['ollama', 'serve'])

def run_async_in_thread(loop, coro):
    asyncio.set_event_loop(loop)
    loop.run_until_complete(coro)
    loop.close()

# Create a new event loop that will run in a new thread
new_loop = asyncio.new_event_loop()

# Start ollama serve in a separate thread so the cell won't block execution
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start()

>>> starting ollama serve
2024/08/07 16:24:46 routes.go:1108: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-07T16:24:46.408Z level=INFO source=images.go:781 msg="total blobs: 2"


## Custom Model File

In [None]:
print(tokenizer.chat_template)

None


In [None]:
tokenizer.all_special_tokens, tokenizer.additional_special_tokens

(['<|endoftext|>', '[PAD]', '<|USER|>', '<|ASSISTANT|>'],
 ['<|USER|>', '<|ASSISTANT|>'])

In [None]:
# https://huggingface.co/Locutusque/gpt2-large-conversational

modelfile = ('FROM ./model/ggml-model-q8_0.gguf'
'\n\nTEMPLATE """'
'\n<|USER|> {{ .Prompt }} <|ASSISTANT|> {{ .Response }} <|endoftext|>'
'\n"""'
'\n\nPARAMETER stop <|endoftext|>'
'\nPARAMETER stop [PAD]'
'\nPARAMETER stop <|ASSISTANT|>')

In [None]:
print(modelfile)

with open('modelfile_custom', 'w') as f:
    f.write(modelfile)

FROM ./model/ggml-model-q8_0.gguf

TEMPLATE """
<|USER|> {{ .Prompt }} <|ASSISTANT|> {{ .Response }} <|endoftext|>
"""

PARAMETER stop <|endoftext|>
PARAMETER stop [PAD]
PARAMETER stop <|ASSISTANT|>


## Loading Model into Ollama

In [None]:
!ollama create -f modelfile_custom medical

[GIN] 2024/08/07 - 16:55:01 | 200 |      32.516µs |       127.0.0.1 | HEAD     "/"
[?25ltransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25

In [None]:
!ollama list

[GIN] 2024/08/07 - 16:56:03 | 200 |       32.73µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/07 - 16:56:03 | 200 |     540.998µs |       127.0.0.1 | GET      "/api/tags"
NAME          	ID          	SIZE  	MODIFIED       
medical:latest	f3161abed622	895 MB	44 seconds ago	


## Querying the Model

In [None]:
import ollama
response = ollama.chat(model='medical', messages=[
  {
    'role': 'user',
    'content': 'I got a bad headache on the left side of my forehead. How can I get better?',
  },
], stream=False)

[GIN] 2024/08/07 - 16:59:25 | 200 |  776.614301ms |       127.0.0.1 | POST     "/api/chat"


In [None]:
print(response['message']['content'])

 an anti-inflammatory medication can be prescribed to relieve pain caused by inflammation of the brain and the blood vessels in the brain. This medication is called aspirin and it is used to treat headaches. It should not be used if you have already had a stroke or if there is any bleeding in your brain.
