# Llama 3.1 gguf Example


## Step 0: Config Setting

In [1]:
GGUF_HUGGINGFACE_REPO = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"
GGUF_HUGGINGFACE_BIN_FILE = "Meta-Llama-3.1-8B-Instruct-Q8_0.gguf"


In [None]:
GGUF_HUGGINGFACE_REPO = "bartowski/Llama-3.2-3B-Instruct-GGUF"
GGUF_HUGGINGFACE_BIN_FILE = "Llama-3.2-3B-Instruct-f16.gguf"


## Step 1: Install python package


### 30 mins to run

In [2]:
import os
import multiprocessing

# Use all available cores
num_cores = multiprocessing.cpu_count()
os.environ['CMAKE_BUILD_PARALLEL_LEVEL'] = str(num_cores)

# Optimize for the current CPU
os.environ['CFLAGS'] = '-march=native -O3'
os.environ['CXXFLAGS'] = '-march=native -O3'

# Use the T4 GPU
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# Speed up pip
!pip install --upgrade pip
!mkdir -p /tmp/pip_cache
!pip config set global.cache-dir /tmp/pip_cache

# Install only necessary build dependencies
!apt-get update && apt-get install -y ninja-build

# Use Ninja as the build system
os.environ['CMAKE_GENERATOR'] = 'Ninja'

# Optimize CMake
os.environ['CMAKE_BUILD_TYPE'] = 'Release'

# Install llama-cpp-python with optimized build settings for T4 GPU
!CMAKE_ARGS="-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=75" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

# Verify installation
!python -c "from llama_cpp import Llama; print(Llama.supports_cuda())"

Collecting pip
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.3.1
Writing to /root/.config/pip/pip.conf
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://developer.download.nvidia.com/co

In [3]:
!apt-get install -y libcairo2-dev

!pip install pycairo jedi

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libblkid-dev libblkid1 libcairo-script-interpreter2 libffi-dev libglib2.0-dev libglib2.0-dev-bin
  libice-dev liblzo2-2 libmount-dev libmount1 libpixman-1-dev libselinux1-dev libsepol-dev
  libsm-dev libxcb-render0-dev libxcb-shm0-dev
Suggested packages:
  libcairo2-doc libgirepository1.0-dev libglib2.0-doc libgdk-pixbuf2.0-bin | libgdk-pixbuf2.0-dev
  libxml2-utils libice-doc cryptsetup-bin libsm-doc
The following NEW packages will be installed:
  libblkid-dev libcairo-script-interpreter2 libcairo2-dev libffi-dev libglib2.0-dev
  libglib2.0-dev-bin libice-dev liblzo2-2 libmount-dev libpixman-1-dev libselinux1-dev libsepol-dev
  libsm-dev libxcb-render0-dev libxcb-shm0-dev
The following packages will be upgraded:
  libblkid1 libmount1
2 upgraded, 15 newly installed, 0 to remove and 54 not upgraded.
Need to get 4,068 kB of archives.
Afte

In [4]:
# 5 mins to run above cmd

!pip check

No broken requirements found.


In [5]:
!nvcc --version


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [6]:
!nvidia-smi


Mon Nov 18 13:14:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Downloading the Model


In [7]:
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id=GGUF_HUGGINGFACE_REPO, filename=GGUF_HUGGINGFACE_BIN_FILE)
model_path

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Meta-Llama-3.1-8B-Instruct-Q8_0.gguf:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

'/root/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/9a8dec50f04fa8fad1dc1e7bc20a84a512e2bb01/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf'

In [9]:
import os

# Get the number of CPU cores
cpu_count = os.cpu_count()
print(f"Number of CPU cores: {cpu_count}")


Number of CPU cores: 2


## Step 3: Give the prompt and predict the chat completion

# Model Initalisation

In [1]:
from llama_cpp import Llama

llm = Llama(
    model_path='/root/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/9a8dec50f04fa8fad1dc1e7bc20a84a512e2bb01/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf',
    n_ctx=60000,
    n_batch=2048,
    n_threads=2,
    n_gpu_layers=28,
    use_mmap=True,
    use_mlock=True,
    offload_kv=False,
    flash_attn=True,
    cache_type_k='q8',
    cache_type_v='q8',
    verbose=True
)
# Input prompt for AI assistant
prompt = """explain thermodynamics."""

# Format the input with system and user context
prompt_template = f'''
SYSTEM:
"Your an helpful AI assistant."

USER: {prompt}

ASSISTANT:
'''


ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
llama_load_model_from_file: using device CUDA0 (Tesla T4) - 14999 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from /root/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/9a8dec50f04fa8fad1dc1e7bc20a84a512e2bb01/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           

In [None]:
print("2. STREAMED RESPONSE:")
print("=" * 50)

# Generating response using the Llama model with streaming
stream = llm(
    prompt=prompt_template,
    max_tokens=50000,
    temperature=0.8,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    echo=False,
    stream=True,
    stop=None
)


response = ''

# Stream the response real-time
print("\nModel response:")
for output in stream:
    if 'choices' in output and len(output['choices']) > 0:
        text_output = output['choices'][0].get('text', '').replace('\r', '')
        response += text_output
        print(text_output, end='', flush=True)

print("\n" + "=" * 50)


2. STREAMED RESPONSE:

Model response:
Thermodynamics is the branch of physics that deals with heat, work, temperature, and their relation to energy, radiation, and physical properties of matter. The three laws of thermodynamics are fundamental principles that describe how these quantities interact.

The Zeroth Law of Thermodynamics states that if two systems are in thermal equilibrium with a third system, then they are also in thermal equilibrium with each other.

The First Law of Thermodynamics is the law of energy conservation and can be expressed as ΔE = Q - W. This means that the change in internal energy (ΔE) is equal to the heat added to a system (Q) minus the work done on or by the system (W).

The Second Law of Thermodynamics states that it's impossible for any process to convert all the heat put into a system completely and efficiently into useful work.

The Third Law of Thermodynamics describes how entropy approaches absolute zero as temperature is lowered. In other words, i