<a href="https://colab.research.google.com/github/NID123-CH/LLM-Codes/blob/main/04_GGUF_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GGUF

In [1]:
!pip install accelerate bitsandbytes peft

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading peft-0.13.2-py3-none-any.whl (320 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes, peft
Successfully installed bitsandbytes-0.44.1 peft-0.13.2


## Install GGUF-PY

In [2]:
!git clone -b b2760 --single-branch https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp/gguf-py && pip install .

Cloning into 'llama.cpp'...
remote: Enumerating objects: 15623, done.[K
remote: Counting objects: 100% (4576/4576), done.[K
remote: Compressing objects: 100% (171/171), done.[K
remote: Total 15623 (delta 4485), reused 4406 (delta 4405), pack-reused 11047 (from 1)[K
Receiving objects: 100% (15623/15623), 21.40 MiB | 9.66 MiB/s, done.
Resolving deltas: 100% (11109/11109), done.
Note: switching to '3f167476b11efa7ab08f6cacdeb8cab0935c1249'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

Processing /content/llama.cpp/gguf-py

## Load Base Model and Tokenizer

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

name = 'microsoft/phi-2'

model = AutoModelForCausalLM.from_pretrained(name, trust_remote_code=True, device_map={"": 0})

tokenizer = AutoTokenizer.from_pretrained(
    name,
    use_fast=False,
)
tokenizer.add_special_tokens({'pad_token': '<|pad|>'})
tokenizer.add_special_tokens({'additional_special_tokens': ['##[YODA]##>']})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

1

### Loads Trained Adapter and Merges

In [4]:
!gdown 1kWPIHny94ZDSQffyRiS15yaUScQVe1Ji
!unzip -d yoda_adapter yoda_adapter.zip

Downloading...
From (original): https://drive.google.com/uc?id=1kWPIHny94ZDSQffyRiS15yaUScQVe1Ji
From (redirected): https://drive.google.com/uc?id=1kWPIHny94ZDSQffyRiS15yaUScQVe1Ji&confirm=t&uuid=6c98c653-cf3b-4c31-8b60-2d93fa80fb65
To: /content/yoda_adapter.zip
100% 535M/535M [00:12<00:00, 43.4MB/s]
Archive:  yoda_adapter.zip
  inflating: yoda_adapter/adapter_config.json  
  inflating: yoda_adapter/adapter_model.safetensors  
  inflating: yoda_adapter/README.md  
  inflating: yoda_adapter/training_args.bin  


In [5]:
from peft import PeftModel

fine_tuned_model = PeftModel.from_pretrained(model, './yoda_adapter')
merged_model = fine_tuned_model.merge_and_unload()

### Save Merged Model

In [6]:
!rm -rf ./model && mkdir model
merged_model.save_pretrained('./model')
tokenizer.save_pretrained('./model')

('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.json',
 './model/merges.txt',
 './model/added_tokens.json')

In [None]:
import gc
del fine_tuned_model
del merged_model
del model
torch.cuda.empty_cache()
gc.collect()

35

## Conversion to GGUF Format

In [7]:
#!curl -L -o convert.py https://github.com/ggerganov/llama.cpp/raw/master/convert.py
#!curl -L -o convert-hf-to-gguf.py https://github.com/ggerganov/llama.cpp/raw/master/convert-hf-to-gguf.py

# commit 3f167476b11efa7ab08f6cacdeb8cab0935c1249
#!curl -L -o convert.py https://raw.githubusercontent.com/ggerganov/llama.cpp/3f167476b11efa7ab08f6cacdeb8cab0935c1249/convert.py
#!curl -L -o convert-hf-to-gguf.py https://raw.githubusercontent.com/ggerganov/llama.cpp/3f167476b11efa7ab08f6cacdeb8cab0935c1249/convert-hf-to-gguf.py

In [8]:
!python ./llama.cpp/convert-hf-to-gguf.py ./model

Loading model: model
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting add_bos_token to False
Exporting model to 'model/ggml-model-f16.gguf'
gguf: loading model part 'model-00001-of-00003.safetensors'
token_embd.weight, n_dims = 2, torch.float32 --> float16
blk.0.attn_norm.bias, n_dims = 1, torch.float32 --> float32
blk.0.attn_norm.weight, n_dims = 1, torch.float32 --> float32
blk.0.ffn_up.bias, n_dims = 1, torch.float32 --> float32
blk.0.ffn_up.weight, n_dims = 2, torch.float32 --> float16
blk.0.ffn_down.bias, n_dims = 1, torch.float32 --> float32
blk.0.ffn_down.weight, n_dims = 2, torch.float32 --> float16
blk.0.attn_output.bias, n_dims = 1, torch.float32 --> float32
blk.0.attn_output.weight, n_dims = 2, torch.float32 --> float16
blk.0.attn_k.bias, n_dims = 1, torch.float32 --> float32
blk.0.attn_k.weight, n_dims 

## Quantization

### Install llama.cpp

In [9]:
!cd llama.cpp && make clean && make
!pip install -r llama.cpp/requirements.txt

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

rm -vrf *.o tests/*.o *.so *.a *.dll be

### Quantize

In [None]:
!./llama.cpp/quantize ./model/ggml-model-f16.gguf ./model/ggml-model-q4_0.gguf q4_0

main: build = 2760 (3f16747)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing './model/ggml-model-f16.gguf' to './model/ggml-model-q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 19 key-value pairs and 453 tensors from ./model/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32      