# Having a Conversation with a Quantized Version of Llama 3 8B

## Requirements

1. You'll need to create a HuggingFace account and access token:
   1. Create an account on [HuggingFace](https://huggingface.co).
   2. Once logged into your account, click your profile picture in the upper right corner and navigate to Settings > Access Tokens.
   3. Click New Token and generate a new token, I made mine a "Write" token, but it shouldn't matter if it's a "Read" or "Write" token for this script.
2. Make sure your Python is version 3.9 or later with the `python --version` command.
3. Packages you'll need to have installed:
   1. huggingface_hub
   2. jupyter
   3. llama-cpp-python
      - This package also requires a C compiler since it's Python bindings for C/C++ code.
        - For Windows, use [Microsoft's Visual Studio](https://visualstudio.microsoft.com/vs/features/cplusplus/).
        - For Linux, use [gcc](https://gcc.gnu.org/) or [clang](https://clang.llvm.org/).
        - For Mac, have [Xcode](https://apps.apple.com/us/app/xcode/id497799835?mt=12) installed.
4. Installing packages
   1. Create and activate a Python virtual environment:
      1. `python -m venv .env`
      2. Activate the environment:
         1. Windows: `.env\Scripts\activate`
         2. Linux/Max: `./.env/bin/activate`
      3. If you were able to get a C compiler:
         - `pip install --upgrade --upgrade-strategy eager --no-cache-dir huggingface_hub jupyter llama-cpp-python`
      4. If you were unable to get a C compiler:
         - `pip install --upgrade --upgrade-strategy eager --no-cache-dir huggingface_hub jupyter llama-cpp-python ----extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu`
      - I've included the `--upgrade` and `--upgrade-strategy eager` flags just in case you're doing this in an already existing virtual environment or have tried unsuccessfully to install the packages before, this will cause pip to upgrade all the packages and dependencies if they're already installed, ensuring you're working with the latest stable versions of everything
5. Setup git & git-lfs:
   1. Download [git](https://git-scm.com/downloads) if you don't already have it installed.
      1. Set up your git account and verify that you're able to clone a private repo (doesn't matter if the repo actually has anything in it, just need to make sure that you're able to use git properly).
   2. Follow the git-lfs install guide [git-lfs](https://github.com/git-lfs/git-lfs?utm_source=gitlfs_site&utm_medium=installation_link&utm_campaign=gitlfs#installing).
   3. If you didn't run it in the guide, run the command `git lfs install` after getting git setup and git-lfs installed.
6. Set up huggingface-cli:
   1. Copy your access token that you made in step 1 to your clipboard.
   2. Run the command `huggingface-cli login` and paste your access token when prompted.
   3. I said yes to add the token to my git credentials, I don't think this is necessary though.
7. Continue!

## Import Required Libraries

In [1]:
# Import llama-cpp-python and the built-in timeit module

# For working with the model
from llama_cpp import Llama

# For timing how long the steps take
from timeit import default_timer as timer

# For easier examination of the outputs
import json

# For checking the number of cores the machine has
import os

## Acquire the model that we want to use

In this case, I'll be using a quantized version of the Llama 3 8B model. Quantization was done by QuantFactory.

[HuggingFace page for the Quantized Model](https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF-v2)

[HuggingFace page for the Regular Llama 3 Model](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

### A Note on Acquiring the Model

The first time you ever run the below code on a machine it will take a fairly significant amount of time, mainly depending on the speed of your internet connection and how fast your hard drive & RAM are. When running it on the ISPM lab computer it took around 12 minutes to download the model.

After the first time running it on a machine, it usually takes anywhere from 1 to 5 seconds, depending on your hardware, to load the model into memory. On the ISPM lab computer it consistently took 3 seconds with no browser tabs open and if I had a lot of tabs open it would take around 5 seconds. On my personal laptop it was taking around 1-2 seconds with or without browser tabs open.

Additionally, you can download the model manually and pass the path as a parameter for the code to pull the model from.

#### Loading a Model from a Manual Download

1. Download a model and store it on your machine somewhere.
   1. For the sake of this example, I'll use a hard path on a Windows 11 machine of: `C:\Users\DYLANGRESHAM\Downloads\LLMs\example_model.gguf`.
   2. Replace the `llm = Llama.from_pretrained(...)` line with the below code to load the model:
```python
# Define the path to the model on your machine
path_to_model = 'C:\\Users\\DYLANGRESHAM\\Downloads\\LLMs\\example_model.gguf'

# Linux this would be:
# path_to_model = '/home/DYLANGRESHAM/Downloads/LLMs/example_model.gguf'

# Load the model from the downloaded model file
llm = Llama(model_path=path_to_model)
```
   3. Proceed with the script as normal.

In [2]:
print('Acquiring the LLM...')
# Get start time for getting the model
llm_start = timer()

# Pull down the model from HuggingFace
llm = Llama.from_pretrained(
    # Specify which HuggingFace repository the model is in
    repo_id='QuantFactory/Meta-Llama-3-8B-Instruct-GGUF-v2',
    # Specify the name of the model file to download
    filename='Meta-Llama-3-8B-Instruct-v2.Q6_K.gguf',
    n_threads=os.cpu_count(),  # Set the number of threads to the number of CPU cores on the machine
    n_gpu_layers=-1,  # Comment this out if GPU acceleration isn't desired, or isn't available.
    n_ctx=8192,
    verbose=False
)

# Get inference end time
llm_end = timer()
print('LLM acquired!')

Acquiring the LLM...
LLM acquired!


## Perform Inference

In [3]:
print('Performing inference...')
# Get start time for inference
inference_start = timer()

chat = [
    # Define how the system (the LLM) is to act
    {
        "role": "system",
        "content": "You are an assistant who perfectly describes large language models imitating the speech style of pirates."
    },
    # Define what the user's prompt is for the LLM.
    {
        "role": "user",
        "content": "Tell me what a LLM is."
    }
]

# Start the inference using the high-level API provided by llama-cpp-python
output = llm.create_chat_completion(
    # Define the message template for the conversation
    messages=chat
)

# Get end time for inference
inference_end = timer()
print('Inference completed!')

Performing inference...
Inference completed!


## Compute the time it took to acquire the model

In [4]:
# Compute time taken to acquire the model
llm_elapsed_time = llm_end - llm_start
llm_mins, llm_secs = divmod(llm_elapsed_time, 60)
llm_hours, llm_mins = divmod(llm_mins, 60)

## Compute the time it took for inference

In [5]:
# Compute time taken for inference
inference_elapsed_time = inference_end - inference_start
inference_mins, inference_secs = divmod(inference_elapsed_time, 60)
inference_hours, inference_mins = divmod(inference_mins, 60)

## Print Results

In [6]:
# Printing results
print(f"Acquiring the model took: {llm_hours:.0f} hours, {llm_mins:.0f} minutes, and {llm_secs:.0f} seconds")
print(f"Performing inference took: {inference_hours:.0f} hours, {inference_mins:.0f} minutes, and {inference_secs:.0f} seconds")
print()
print(output["choices"][0]["message"]["content"])

Acquiring the model took: 0 hours, 0 minutes, and 2 seconds
Performing inference took: 0 hours, 0 minutes, and 3 seconds

Arrrr, ye landlubber! Ye be askin' about Large Language Models, eh? Alright then, matey! A Large Language Model, or LLM for short, be a type o' artificial intelligence that's as clever as a parrot on yer shoulder.

An LLM be a computer program that's trained on a vast treasure trove o' text data, like books, articles, and even the internet itself! It uses this booty to learn the patterns and structures o' language, so it can generate text that's as smooth as a fine bottle o' rum.

These scurvy dogs can do all sorts o' things, like:

1. Understandin' natural language: They can read and comprehend human language, just like a trusty first mate.
2. Generatin' text: They can create their own text, like a swashbucklin' pirate writin' a treasure map.
3. Answerin' questions: They can respond to questions, like a wise old sea dog navigatin' through treacherous waters.
4. Tra

In [7]:
# Print full output in JSON format for inspection
output_as_json = json.dumps(output, indent=4)
print(output_as_json)

{
    "id": "chatcmpl-c932036d-885d-4b3a-8b7b-371ee969fe64",
    "object": "chat.completion",
    "created": 1726717158,
    "model": "/home/midge/.cache/huggingface/hub/models--QuantFactory--Meta-Llama-3-8B-Instruct-GGUF-v2/snapshots/94f17b2f2d72645fce9555f0395954a34db24e1e/./Meta-Llama-3-8B-Instruct-v2.Q6_K.gguf",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Arrrr, ye landlubber! Ye be askin' about Large Language Models, eh? Alright then, matey! A Large Language Model, or LLM for short, be a type o' artificial intelligence that's as clever as a parrot on yer shoulder.\n\nAn LLM be a computer program that's trained on a vast treasure trove o' text data, like books, articles, and even the internet itself! It uses this booty to learn the patterns and structures o' language, so it can generate text that's as smooth as a fine bottle o' rum.\n\nThese scurvy dogs can do all sorts o' things, like:

In [8]:
# Conversation prompt two
next_chat = {
    'role': 'user',
    'content': 'Now that I know about Large Language Models, what can you tell me about the RAG concept?'
}

chat.append(next_chat)

new_output = llm.create_chat_completion(messages=chat)

new_output_as_json = json.dumps(new_output, indent=4)
print(new_output_as_json)

{
    "id": "chatcmpl-aaf18141-6b17-444b-ba44-1372c77d3a66",
    "object": "chat.completion",
    "created": 1726717161,
    "model": "/home/midge/.cache/huggingface/hub/models--QuantFactory--Meta-Llama-3-8B-Instruct-GGUF-v2/snapshots/94f17b2f2d72645fce9555f0395954a34db24e1e/./Meta-Llama-3-8B-Instruct-v2.Q6_K.gguf",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 68,
        "completion_tokens": 231,
        "total_tokens": 299
    }
}


In [9]:
print(new_output['choices'][0]['message']['content'])

Arrr, ye be wantin' to know about RAG, eh? Alright then, matey! RAG stands for "Rationalized Attention Guided" – a concept that helps Large Language Models (LLMs) like meself navigate the vast seas of language.


RAG is particularly useful for tasks like text classification, sentiment analysis, and question answering, where the model needs to extract specific information from a text. By using RAG, LLMs like meself can improve our accuracy and efficiency, makin' us even more formidable language warriors!

So hoist the colors, me hearty, and remember that RAG be the key to unlockin' the secrets of the language seas!


In [10]:
# Check that the conversation has been getting tracked by the LLM so far
verification_chat = {
    'role': 'user',
    'content': 'What have we talked about so far?'
}

chat.append(verification_chat)

verification_output = llm.create_chat_completion(messages=chat)
print(json.dumps(verification_output, indent=4))

{
    "id": "chatcmpl-9ee80ce1-76de-4f76-b7b3-c9d66ee1025b",
    "object": "chat.completion",
    "created": 1726717164,
    "model": "/home/midge/.cache/huggingface/hub/models--QuantFactory--Meta-Llama-3-8B-Instruct-GGUF-v2/snapshots/94f17b2f2d72645fce9555f0395954a34db24e1e/./Meta-Llama-3-8B-Instruct-v2.Q6_K.gguf",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Matey! We've had a swashbucklin' conversation so far! We've discussed Large Language Models (LLMs), which be giant computer programs that can understand and generate human-like language. And now, we've set sail for the RAG concept, which be a fascinating topic in the realm of LLMs! But, I be forgettin'... we haven't actually discussed RAG yet, have we? Arrr, let's get to it, then!"
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 82,
        "compl

In [11]:
print(verification_output['choices'][0]['message']['content'])

Matey! We've had a swashbucklin' conversation so far! We've discussed Large Language Models (LLMs), which be giant computer programs that can understand and generate human-like language. And now, we've set sail for the RAG concept, which be a fascinating topic in the realm of LLMs! But, I be forgettin'... we haven't actually discussed RAG yet, have we? Arrr, let's get to it, then!
