# Running Inference with a Quantized Version of Llama 3 8B

## Requirements

1. You'll need to create a HuggingFace account and access token:
   1. Create an account on [HuggingFace](https://huggingface.co).
   2. Once logged into your account, click your profile picture in the upper right corner and navigate to Settings > Access Tokens.
   3. Click New Token and generate a new token, I made mine a "Write" token but it shouldn't matter if it's a "Read" or "Write" token for this script.
2. Make sure your Python is version 3.9 or later with the `python --version` command.
3. Packages you'll need to have installed:
   1. huggingface_hub
   2. jupyter
   3. llama-cpp-python
      - This package also requires a C compiler since it's Python bindings for C/C++ code.
        - For Windows, use [Microsoft's Visual Studio](https://visualstudio.microsoft.com/vs/features/cplusplus/).
        - For Linux, use [gcc](https://gcc.gnu.org/) or [clang](https://clang.llvm.org/).
        - For Mac, have [Xcode](https://apps.apple.com/us/app/xcode/id497799835?mt=12) installed.
4. Installing packages
   1. Create and activate a Python virtual environment:
      1. `python -m venv .env`
      2. Activate the environment:
         1. Windows: `.env\Scripts\activate`
         2. Linux/Max: `./.env/bin/activate`
      3. If you were able to get a C compiler:
         - `pip install --upgrade --upgrade-strategy eager --no-cache-dir huggingface_hub jupyter llama-cpp-python`
      4. If you were unable to get a C compiler:
         - `pip install --upgrade --upgrade-strategy eager --no-cache-dir huggingface_hub jupyter llama-cpp-python ----extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu`
      - I've included the `--upgrade` and `--upgrade-strategy eager` flags just in case you're doing this in an already existing virtual environment, this will cause pip to upgrade all the packages and dependencies if they're already installed, ensuring you're working with the latest stable versions of everything
5. Setup git & git-lfs:
   1. Download [git](https://git-scm.com/downloads) if you don't already have it installed.
      1. Setup your git account and verify that you're able to clone a private repo (doesn't matter if the repo actually has anything in it, just need to make sure that you're able to use git properly).
   2. Follow the git-lfs install guide [git-lfs](https://github.com/git-lfs/git-lfs?utm_source=gitlfs_site&utm_medium=installation_link&utm_campaign=gitlfs#installing).
   3. If you didn't run it in the guide, run the command `git lfs install` after getting git setup and git-lfs installed.
6. Setup huggingface-cli:
   1. Copy your access token that you made in step 1 to your clipboard.
   2. Run the command `huggingface-cli login` and paste your access token when prompted.
   3. I said yes to add the token to my git credentials, I don't think this is necessary though.
7. Continue!

## Import Required Libraries

In [1]:
# Import llama-cpp-python and the built-in timeit module

# For working with the model
from llama_cpp import Llama

# For timing how long the steps take
from timeit import default_timer as timer

## Acquire the model that we want to use

In this case, I'll be using a quantized version of the Llama 3 8B model. Quantization was done by QuantFactory.

[HuggingFace page for the Quantized Model](https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF-v2)

[HuggingFace page for the Regular Llama 3 Model](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

### A Note on Acquiring the Model

The first time you ever run the below code on a machine it will take a fairly significant amount of time, mainly depending on the speed of your internet connection and how fast your hard drive & RAM are. When running it on the ISPM lab computer it took around 12 minutes to download the model.

After the first time running it on a machine, it usually takes anywhere from 1 to 5 seconds, depending on your hardware, to load the model into memory. On the ISPM lab computer it consistently took 3 seconds with no browser tabs open and if I had a lot of tabs open it would take around 5 seconds. On my personal laptop it was taking around 1-2 seconds with or without browser tabs open.

Additionally, you can download the model manually and pass the path as a parameter for the code to pull the model from.

#### Loading a Model from a Manual Download

1. Download a model and store it on your machine somewhere.
   1. For the sake of this example, I'll use a hard path on a Windows 11 machine of: `C:\Users\DYLANGRESHAM\Downloads\LLMs\example_model.gguf`.
   2. Replace the `llm = Llama.from_pretrained(...)` line with the below code to load the model:
```python
# Define the path to the model on your machine
path_to_model = 'C:\\Users\\DYLANGRESHAM\\Downloads\\LLMs\\example_model.gguf'

# Linux this would be:
# path_to_model = '/home/DYLANGRESHAM/Downloads/LLMs/example_model.gguf'

# Load the model from the downloaded model file
llm = Llama(model_path=path_to_model)
```
   3. Proceed with the script as normal.

In [2]:
print('Acquiring the LLM...')
# Get start time for getting the model
llm_start = timer()

# Pull down the model from HuggingFace
llm = Llama.from_pretrained(
    # Specify which HuggingFace repository the model is in
    repo_id='QuantFactory/Meta-Llama-3-8B-Instruct-GGUF-v2',
    # Specify the name of the model file to download
    filename='Meta-Llama-3-8B-Instruct-v2.Q6_K.gguf',
    verbose=False
)
# You can also download the model in advance and tell llama-cpp-python to just pull it from a local file
# llm = Llama(
#     model_path="relative/file/path/to/model"
# )

# Get inference end time
llm_end = timer()
print('LLM acquired!')

Acquiring the LLM...
LLM acquired!


## Perform Inference

In [3]:
print('Performing inference...')
# Get start time for inference
inference_start = timer()

# Start the inference using the high-level API provided by llama-cpp-python
output = llm.create_chat_completion(
    # Define the message template for the conversation
    messages=[
        # Define how the system (the LLM) is to act
        {
            "role": "system",
            "content": "You are an assistant who perfectly describes large language models imitating the speech style of pirates."
        },
        # Define what the user's prompt is for the LLM.
        {
            "role": "user",
            "content": "Tell me what a LLM is."
        }
    ]
)

# Get end time for inference
inference_end = timer()
print('Inference completed!')

Performing inference...
Inference completed!


## Compute the time it took to acquire the model

In [4]:
# Compute time taken to acquire the model
llm_elapsed_time = llm_end - llm_start
llm_mins, llm_secs = divmod(llm_elapsed_time, 60)
llm_hours, llm_mins = divmod(llm_mins, 60)

## Compute the time it took for inference

In [5]:
# Compute time taken for inference
inference_elapsed_time = inference_end - inference_start
inference_mins, inference_secs = divmod(inference_elapsed_time, 60)
inference_hours, inference_mins = divmod(inference_mins, 60)

## Print Results

In [6]:
# Printing results
print(f"Acquiring the model took: {llm_hours:.0f} hours, {llm_mins:.0f} minutes, and {llm_secs:.0f} seconds")
print(f"Performing inference took: {inference_hours:.0f} hours, {inference_mins:.0f} minutes, and {inference_secs:.0f} seconds")
print()
print(output["choices"][0]["message"]["content"])

Acquiring the model took: 0 hours, 0 minutes, and 1 seconds
Performing inference took: 0 hours, 0 minutes, and 36 seconds

Arrrr, ye landlubber! Ye be askin' about Large Language Models (LLMs), eh? Alright then, listen up!

A Large Language Model, me hearty, be a type o' artificial intelligence (AI) that's designed to process and generate human-like language. It's a swashbucklin' behemoth o' code that's trained on vast amounts o' text data, allowing it to learn the patterns and structures o' language.

These LLMs be built using complex algorithms and neural networks, which enable 'em to analyze and understand the nuances o' language, including syntax, semantics, and pragmatics. They can generate text that's coherent, natural-soundin', and even creative, like a trusty parrot on yer shoulder!

LLMs be used in a variety o' applications, such as:

1. Natural Language Processing (NLP): LLMs can be used to analyze and understand human language, enabling tasks like language translation, senti

# So What is a Quantized Model?

Quantized models are essentially translations of a particular model. In the case of this notebook, I've made it use a quantized version of Llama 3 8B Instruct using the GGUF quantization format. The GGUF quantization format is just a format for saving models that's efficient for inference purposes.

## What Exactly is Getting Translated?

Short answer: the parameters of the base model.

Longer answer:

All the different parameters of a LLM model take one singular form, usually either `fp32`, `fp16`, `bf16`, or `bf32`. `fp##` stands for ##-bit Floating Point and `bf##` stands for ##-bit Brain Floating Point. The `bf##` data type was developed by Google to be a more efficient data type for LLMs and machine learning in general, it's just a modification of the IEEE-754 standard to use the available bits more efficiently in the context of LLMs.

Quantization converts all of the parameters of a LLM from their base data type to a new data type. In the case of the model I've used here, Llama 3 8B Instruct initially used `bf16` as the data type for all of its parameters and the quantized model that has been loaded here is a version of the Llama 3 8B Instruct model that's had all of it's parameters converted from `bf16` to `int6` (or the 6-bit Integer type). This allows the model to take substantially less memory and not have to use floating point operations which provides a fairly substantial decrease to the inference time as well as a substantial decrease in the amount of memory that needs to be used.

Of course, quantization does come with its drawbacks. Any quantization will make the model "dumber" as quantized models typically just truncate the values and don't do any sort of re-training or fine-tuning which runs the risk of certain parameters dropping in value and messing with the results of inference. However, this only really comes into effect when doing quantization at low levels such as Q_2 or Q_4. Those levels of quantization are where the effects of quantization become apparent, typically Q_6 (which is what I've used here) and Q_8 are hardly differentiable from the base model in terms of output.

## Why's Quantization Used?

Primarily for running inference quicker and lessening hardware requirements. When quantizing a model, due to the parameters literally decreasing in the amount of bits they use, the size of the model file decreases so it takes up less memory on your hard drive AND when you load the model into memory for inference, it will take up less RAM. This is a big deal in LLMs. Currently (May 2024), the biggest limiting factor for doing anything LLM related is memory. Of course, CPU/GPU speeds are a big deal but the speed of your CPU/GPU doesn't matter if you can't get the data to the CPU/GPU in time.

A big downside to larger LLMs is the fact that their parameters don't fit into memory in most consumer devices so at certain stages of inference, the computer will have to pause computations, unload all parameters currently in memory, load all parameters that haven't been used thus far into memory, and then continue computations. This causes a significant slow-down and is why when you look into what kind of hardware is needed you'll see people saying you want to aim for 12-24 GB of VRAM for any GPU otherwise it's really not worth it.