<a href="https://colab.research.google.com/github/Bhabuk10/FineTuning_LLMs/blob/main/Quantizing_LLMs_with_Llama_cpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Quantizing and Running Llama 3.1 Model on CPU and GPU with llama.cpp


# In this notebook, we will walk through the process of downloading the Llama 3.1 model from HuggingFace, quantizing it using the llama.cpp library, and running inference with the quantized model.


# Download the Pre-trained Model from HuggingFace

We are going to download the 8B version of Llama 3.1 from HuggingFace.


In [None]:
from huggingface_hub import snapshot_download

llama_model_id = "meta-llama/Meta-Llama-3.1-8B"
local_model_dir = "./downloaded_model/"  # Directory to store the downloaded model

In [None]:
# ignoring *.pth files as we don't need PyTorch weights.
snapshot_download(repo_id=llama_model_id, local_dir=local_model_dir, ignore_patterns=["*.pth"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/7.63k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/40.9k [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/4.69k [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

original/params.json:   0%|          | 0.00/199 [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

'/content/downloaded_model'

# Clone the llama.cpp Repository



In [None]:
!git clone https://github.com/ggerganov/llama.cpp.git

Cloning into 'llama.cpp'...
remote: Enumerating objects: 34776, done.[K
remote: Counting objects: 100% (6758/6758), done.[K
remote: Compressing objects: 100% (361/361), done.[K
remote: Total 34776 (delta 6569), reused 6465 (delta 6396), pack-reused 28018 (from 1)[K
Receiving objects: 100% (34776/34776), 58.27 MiB | 16.18 MiB/s, done.
Resolving deltas: 100% (25224/25224), done.


In [None]:
# Create a folder to store the quantized models.
!mkdir models


# Convert the Model to GGUF Format
Llama.cpp uses a GGUF format for handling Llama models.

In [None]:
!python llama.cpp/convert_hf_to_gguf.py ./downloaded_model/ --outfile models/llama_3.1_FP16.gguf

INFO:hf-to-gguf:Loading model: downloaded_model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloat16 --> F1

#  Build llama.cpp and Quantize the Model
We will now build the llama.cpp library and quantize the GGUF model using the 'q4_K_M' quantization method.

In [None]:
!mkdir llama.cpp/build && cd llama.cpp/build && cmake .. && cmake --build . --config Release


-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- OpenMP found
-- Using llamafile
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (1.8s)
-- Generating done (0.2s)
-- Build files have been written to: /co

In [None]:
# Check the contents of the 'models' directory to verify the existence of the converted file.
!ls /content/models/


llama3.1_FP16.gguf


In [None]:
# Quantize the model from FP16 (floating-point) to the Q4_K_M format.

!cd /content/llama.cpp/build/bin && ./llama-quantize /content/models/llama3.1_FP16.gguf /content/models/llama_3.1_Q4_K_M.gguf q4_K_M

main: build = 3827 (7691654c)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/models/llama3.1_FP16.gguf' to '/content/models/llama_3.1_Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 26 key-value pairs and 292 tensors from /content/models/llama3.1_FP16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Downloaded_Model
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = llama3.1
llama_model_loader: - kv   5:          

In [None]:
# !cd llama.cpp/build/bin && ./llama-quantize /content/models/llama_3.1_FP16.gguf /content/models/llama_3.1-Q4_K_M.gguf q4_K_M

# Inference using the Quantized Model
 Now, we will load the quantized model and run inference using the `llama-cpp-python` library.

In [None]:
!pip install llama-cpp-python==0.2.85

Collecting llama-cpp-python==0.2.85
  Downloading llama_cpp_python-0.2.85.tar.gz (49.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.85)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.85-cp310-cp310-linux_x86_64.whl size=2873166 sha256=50e

In [None]:
from llama_cpp import Llama

# Specify the path to the quantized model in Q4_K_M format.
quantized_model_path = "/content/models/llama_3.1_Q4_K_M.gguf"

# Initialize the Llama model using the quantized model.
llm = Llama(model_path=quantized_model_path)


llama_model_loader: loaded meta data with 26 key-value pairs and 292 tensors from /content/models/llama_3.1_Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Downloaded_Model
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = llama3.1
llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   6:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "p

In [None]:
# Define the parameters for text generation (inference).
generation_params = {
    "max_tokens": 256,  # Limit the number of tokens generated
    "echo": False,
    "top_k": 1
}

# Provide a prompt for the model to generate a response.
user_prompt = "Which nation won the FIFA World Cup 2022 ?"

# Run inference using the Llama model.
output = llm(user_prompt, **generation_params)

# Display the model's response.
print(output)


llama_print_timings:        load time =    4313.24 ms
llama_print_timings:      sample time =      25.16 ms /   256 runs   (    0.10 ms per token, 10173.26 tokens per second)
llama_print_timings: prompt eval time =    4313.13 ms /    12 tokens (  359.43 ms per token,     2.78 tokens per second)
llama_print_timings:        eval time =  175457.88 ms /   255 runs   (  688.07 ms per token,     1.45 tokens per second)
llama_print_timings:       total time =  180327.31 ms /   267 tokens


{'id': 'cmpl-708d3126-f456-4982-9650-b1adcf5ffd6a', 'object': 'text_completion', 'created': 1727339640, 'model': '/content/models/llama_3.1_Q4_K_M.gguf', 'choices': [{'text': ' The answer is: Argentina. The 2022 FIFA World Cup was the 22nd edition of the FIFA World Cup, an international men’s football tournament contested by the senior national teams of the member associations of FIFA. It took place in Qatar from 20 November to 18 December 2022. Argentina won the tournament after defeating France 4–2 in the final. This was Argentina’s third World Cup title, and their first since 1986. Lionel Messi, who played for Argentina, was named the tournament’s best player.\nThe 2022 FIFA World Cup was the 22nd edition of the FIFA World Cup, an international men’s football tournament contested by the senior national teams of the member associations of FIFA. It took place in Qatar from 20 November to 18 December 2022. Argentina won the tournament after defeating France 4–2 in the final. This was A

# Save the Quantized Model to Google Drive
It's a good idea to save the quantized model to Google Drive for future use.

In [None]:
from google.colab import drive
import shutil

# Create a directory in your Google Drive to store the model.
!mkdir "/content/drive/My Drive/quantized_models"


# Copy the quantized model to Google Drive.

src_model_path = '/content/models/llama_3.1_Q4_K_M.gguf'
dest_model_path = "/content/drive/My Drive/quantized_models/llama_3.1_Q4_K_M.gguf"
shutil.copy(src_model_path, dest_model_path)


'/content/drive/My Drive/quantized_models/llama_3.1_Q4_K_M.gguf'

#Upload the Model to HuggingFace Hub
You can also upload the quantized model to HuggingFace Hub for easy access and sharing.


In [None]:

from huggingface_hub import login, HfApi

# Authenticate to HuggingFace (if not already logged in earlier).
login('hf api key')

# Initialize HuggingFace API for uploading the model.
api = HfApi()

# Define a new model repository ID for the quantized model.
new_model_id = "vhab10/llama_3.1_8b_Q4_K_M-gguf"

# Create the model repository (if it doesn't already exist).
api.create_repo(new_model_id, exist_ok=True, repo_type="model")

# Upload the quantized model file .
api.upload_file(
    path_or_fileobj=src_model_path,
    path_in_repo="llama_3.1_8b_Q4_K_M.gguf",
    repo_id=new_model_id
)


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


llama_3.1_Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vhab10/llama_3.1_8b_Q4_K_M-gguf/commit/318b0c16735b3ab4f98e28da3a900d662acbe3a0', commit_message='Upload llama_3.1_8b_Q4_K_M.gguf with huggingface_hub', commit_description='', oid='318b0c16735b3ab4f98e28da3a900d662acbe3a0', pr_url=None, pr_revision=None, pr_num=None)