We use llama_cpp to evaluate LLM followed by this article:

https://towardsdatascience.com/gguf-quantization-with-imatrix-and-k-quantization-to-run-llms-on-your-cpu-02356b531926

Only use the evaluation part.

# Install llama cpp

In [None]:
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
# !export GGML_CUDA=1 
!cp -r /usr/local/cuda-12.3/targets /usr/local/nvidia/ 
!make GGML_CUDA=1 CUDA_PATH=/usr/local/nvidia  > make.log 2>&1

In [None]:
!tail make.log


In [None]:
!pip install --force-reinstall -r requirements.txt > pip_install.log 2>&1
!tail pip_install.log

# Download gguf model

## Prepare huggingface token

In [None]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient


user_secrets = UserSecretsClient()

os.environ["HF_TOKEN"]=user_secrets.get_secret("HUGGINGFACE_TOKEN")

login(os.environ["HF_TOKEN"])

## Download gguf model

In [None]:
from huggingface_hub import snapshot_download
model_name = "google/gemma-2-2b-it" # the model we want to quantize
methods = ['Q4_K_S','Q4_K_M'] #the methods to be used for quantization
base_model = "./original_model_gemma2-2b/" # where the FP16 GGUF model will be stored
quantized_path = "./quantized_model_gemma2-2b/" #where the quantized GGUF model will be stored
original_model = quantized_path + 'FP16.gguf'

snapshot_download(repo_id=model_name, local_dir=base_model , local_dir_use_symlinks=False)


## Convert model to gguf 

In [None]:
!mkdir -p /kaggle/working/llama.cpp/quantized_model_gemma2-2b/
!python convert_hf_to_gguf.py "/kaggle/working/llama.cpp/original_model_gemma2-2b/" --outfile "/kaggle/working/llama.cpp/quantized_model_gemma2-2b/FP16.gguf"

## Get wiki text as dataset

In [None]:
!wget https://object.pouta.csc.fi/OPUS-Wikipedia/v1.0/mono/en.txt.gz
!gunzip en.txt.gz
!head -n 10000 en.txt > en-h10000.txt
!sh scripts/get-wikitext-2.sh

# Benchmarking the Perplexity 

Perplexity can be used to compare the models before and after quantization or other method. Here is a explaination of why we could not compare different models by benchmarking perplexity

https://thesalt.substack.com/p/why-cant-we-compare-the-perplexity


In [None]:
!./llama-perplexity -m /kaggle/working/llama.cpp/quantized_model_gemma2-2b/FP16.gguf -f wikitext-2-raw/wiki.test.raw --chunks 16

# Benchmarking the Inference Throughput and Memory Consumption 

In [None]:
!./llama-bench -m /kaggle/working/llama.cpp/quantized_model_gemma2-2b/FP16.gguf -n 16 -mg 1