<a href="https://colab.research.google.com/github/R3gm/InsightSolver-Colab/blob/main/LLM_Inference__GGUF__WizardCoder_Python_34B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WizardCoder-Python-34B on Colab's free tier with GGUF



`GGUF` is an enhancement over the "llama.cpp" file format, addressing the constraints of the current ".bin" files. Unlike the existing format, GGUF permits inclusion of supplementary model information in a more adaptable manner and supports a wider range of model types

llama-cpp-python allows us to perform inference with quantized language models. For more details on how to use it, you can visit the following notebook at [![Open](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/R3gm/InsightSolver-Colab/blob/main/LLM_Inference_with_llama_cpp_python__Llama_2_13b_chat.ipynb).

| Code Credits | Link |
| ----------- | ---- |
| 🎉 llama-cpp-python | [![GitHub Repository](https://img.shields.io/github/stars/abetlen/llama-cpp-python?style=social)](https://github.com/abetlen/llama-cpp-python) |
| 🔥 Discover More Colab Notebooks | [![GitHub Repository](https://img.shields.io/badge/GitHub-Repository-black?style=flat-square&logo=github)](https://github.com/R3gm/InsightSolver-Colab/) |


In [None]:
# GPU llama-cpp-python; Starting from version llama-cpp-python==0.1.79, it supports GGUF
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
# For download the models
!pip install huggingface_hub

# Select the model

When we use GGUF, we can offload model layers to the GPU, which facilitates inference time; we can do this with all layers, but what will allow us to run large models on a T4 is the support of RAM on the CPU. Therefore, we will use both the GPU and CPU for inference. Since Colab only provides us with 2 CPU cores, this inference can be quite slow, but it will still allow us to run models like llama 2 70B that have been quantized previously.

We will use the quantized model [WizardCoder-Python-34B-V1.0-GGUF](https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GGUF) from [WizardCoder Python 34B](https://huggingface.co/WizardLM/WizardCoder-Python-34B-V1.0) with the k-quants method Q4_K_M


In [2]:
model_name_or_path = "TheBloke/WizardCoder-Python-34B-V1.0-GGUF"
model_basename = "wizardcoder-python-34b-v1.0.Q4_K_M.gguf"

About k-quants

`k-quants` are a series of quantization methods ranging from 2 to 6 bits, designed to enhance both model size and performance for language generation tasks. The primary objective is to empower users to choose the most suitable quantized model considering their hardware constraints, all while upholding high-quality generation capabilities. Notably, these methods introduce only marginal quantization errors; for instance, 6-bit quantization maintains a perplexity within 0.1% of the original fp16 model's performance. This suite of quantization techniques empowers models to gracefully adapt to memory constraints on devices, all while delivering remarkable generation performance.

- Table about perplexity on the wikitext dataset as a function of model size


![link text](https://user-images.githubusercontent.com/48489457/243093269-07aa49f0-4951-407f-9789-0b5a01ce95b8.png)


Note: Perplexity is a metric used to evaluate language models by measuring how well they predict sequences of words, with lower values indicating better predictive performance.

First, we download the model

In [3]:
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename,
    cache_dir= '/content/models' # Directory for the model
)

Downloading (…)34b-v1.0.Q4_K_M.gguf:   0%|          | 0.00/20.5G [00:00<?, ?B/s]

## Stream inference with llama-cpp-python

Loading the model

In [None]:
from llama_cpp import Llama

llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=30, # The max for this model is 30 in a T4, If you use llama 2 70B, you'll need to put fewer layers on the GPU
    n_ctx=4096, # Context window
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


Prompt.

In [None]:
prompt = "Example of linear regression in python"
prompt_template=f'''Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{prompt}

### Response:'''

Stream response

We are using both the GPU and the CPU for inference, which will allow us to use the model but with a very slow generation.

In [None]:
stream = llm(
    prompt_template,
    max_tokens=16, # Number of new tokens to be generated, increase it for a longer response
    temperature=0.8,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    echo=False,
    stream=True,
    stop=["Instruction:", "Response:"] # Stop generation when such token appears
)

response = ''
for output in stream:
    text_output = output['choices'][0]['text'].replace('\r', '')
    print("\033[34m" + text_output + "\033[0m", end ='')
    response += text_output

Llama.generate: prefix-match hit


[34mS[0m[34mure[0m[34m![0m[34m Here[0m[34m'[0m[34ms[0m[34m an[0m[34m example[0m[34m of[0m[34m how[0m[34m to[0m[34m perform[0m[34m linear[0m[34m regression[0m[34m using[0m[34m Python[0m[34m:[0m[34m[0m[34m
[0m[34m[0m[34m
[0m[34m[0m[34m
[0m[34m```[0m[34mpython[0m[34m[0m[34m
[0m[34mimport[0m[34m pandas[0m[34m as[0m[34m pd[0m[34m [0m[34m[0m

In [None]:
print(response)

# Convert a GGML model to GGUF

`Restart the runtime` before running this section.

We download a GGML model, which is no longer supported currently, but we can convert them to GGUF.

In [7]:
# Requirements
!pip install gguf
!wget https://raw.githubusercontent.com/ggerganov/llama.cpp/master/convert-llama-ggmlv3-to-gguf.py

Collecting gguf
  Downloading gguf-0.3.0-py3-none-any.whl (9.7 kB)
Installing collected packages: gguf
Successfully installed gguf-0.3.0
--2023-08-30 17:57:31--  https://raw.githubusercontent.com/ggerganov/llama.cpp/master/convert-llama-ggmlv3-to-gguf.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15633 (15K) [text/plain]
Saving to: ‘convert-llama-ggmlv3-to-gguf.py’


2023-08-30 17:57:31 (21.7 MB/s) - ‘convert-llama-ggmlv3-to-gguf.py’ saved [15633/15633]



Select the GGML model

In [10]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin"

In [11]:
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename,
    cache_dir= '/content/models'
)

Downloading (…)chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

Convert GGML to GGUF using the script provided by llama.cpp.

In [8]:
# Experimental
!python convert-llama-ggmlv3-to-gguf.py\
  -i $model_path \
  -o llama-2-13b-GGUF.gguf \
  --context-length 4096 \
  --eps 1e-5

* Using config: Namespace(input=PosixPath('models--TheBloke--Llama-2-13B-chat-GGML/snapshots/47d28ef5de4f3de523c421f325a2e4e039035bab/llama-2-13b-chat.ggmlv3.q5_1.bin'), output=PosixPath('llama-2-13b-GGUF.gguf'), name=None, desc=None, gqa=1, eps='1e-5', context_length=4096, model_metadata_dir=None, vocab_dir=None, vocabtype='spm')


* Scanning GGML input file
* GGML model hyperparameters: <Hyperparameters: n_vocab=32000, n_embd=5120, n_mult=256, n_head=40, n_layer=40, n_rot=128, n_ff=13824, ftype=9>


* Preparing to save GGUF file
* Adding model parameters and KV items
* Adding 32000 vocab item(s)
* Adding 363 tensor(s)
    gguf: write header
    gguf: write metadata
    gguf: write tensors
* Successful completion. Output saved to: llama-2-13b-GGUF.gguf


In [12]:
# We select the directory of our converted model.
model_path = 'llama-2-13b-GGUF.gguf'

In [13]:
from llama_cpp import Llama

llm = Llama(
    model_path=model_path,
    n_threads=2,
    n_batch=512,
    n_gpu_layers=43, # On a T4, we can offload all layers of a 13B model.
    n_ctx=4096,
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


In [14]:
prompt = "Write a poem about llamas."
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

Stream inference

In [None]:
stream = llm(
    prompt_template,
    max_tokens=1024,
    temperature=0.7,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    echo=False,
    stream=True,
    stop=["USER:", "ASSISTANT:", "SYSTEM:"]
)

response = ''
for output in stream:
    text_output = output['choices'][0]['text'].replace('\r', '')
    response += text_output
    print("\033[31m" + text_output + "\033[0m", end ='')

[31m
[0m[31mOh[0m[31m ll[0m[31mama[0m[31m,[0m[31m oh[0m[31m so[0m[31m fine[0m[31m
[0m[31mYour[0m[31m w[0m[31mool[0m[31mly[0m[31m coat[0m[31m,[0m[31m your[0m[31m gentle[0m[31m eyes[0m[31m
[0m[31mYou[0m[31m ro[0m[31mam[0m[31m the[0m[31m And[0m[31me[0m[31man[0m[31m high[0m[31mlands[0m[31m divine[0m[31m
[0m[31mA[0m[31m symbol[0m[31m of[0m[31m grace[0m[31m and[0m[31m pride[0m[31m
[0m[31m
[0m[31mWith[0m[31m ne[0m[31mcks[0m[31m that[0m[31m b[0m[31mend[0m[31m and[0m[31m legs[0m[31m that[0m[31m stretch[0m[31m
[0m[31mYou[0m[31m str[0m[31mut[0m[31m across[0m[31m the[0m[31m mountains[0m[31mide[0m[31m
[0m[31mYour[0m[31m sp[0m[31mit[0m[31m-[0m[31mcurl[0m[31ms[0m[31m dri[0m[31mpping[0m[31m,[0m[31m ears[0m[31m ere[0m[31mct[0m[31m
[0m[31mA[0m[31m creature[0m[31m of[0m[31m beauty[0m[31m,[0m[31m inside[0m[31m
[0m[31m
[0m[31mIn[0m[31m the[0m[31m f

In [None]:
print(response)

# References
- [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684)
- [renenyffenegger.ch - LLaMA C++ Library](https://renenyffenegger.ch/notes/development/Artificial-intelligence/language-model/LLM/LLaMA/libs/llama_cpp/)
- [LLaMA C++ Library Documentation](https://llama-cpp-python.readthedocs.io/en/latest/)
- [MacOS Install with Metal GPU - LLaMA C++ Library Documentation](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/)
