# *LLM Performance Comparison*
---
**Running in Google Colab**

- Runtime -> Change runtime type.
- Choose a GPU runtime (with at least a T4 GPU, probably an A100 if you're comparing two models or more).
- You'll need 60 GB of disk space to download the full weights of Llama 7B, 13B and Mistral 7B.
- Run all cells.

---
Prepared by Trelis Research.

Find Trelis on [Github](https://github.com/TrelisResearch), [HuggingFace](https://huggingface.co/Trelis) and [YouTube](https://www.youtube.com/@TrelisResearch).



#### HuggingFace Login (optional)
- You don't need this if you are using Trelis Function Calling Llama 2 7B, which is public.
- You do need this to access private/gated repositories.

In [1]:
# !pip install huggingface_hub
# from huggingface_hub import notebook_login

# notebook_login()

#### Google Drive Mounting (optional, but recommended)
This saves you time the next time you load the model.

If you don't use it, remove cache_dir from the model and tokeniser below.

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')


In [3]:
# import os

# # This is the path to the Google Drive folder.
# drive_path = "/content/drive"

# # This is the path where you want to store your cache.
# cache_dir_path = os.path.join(drive_path, "My Drive/huggingface_cache")

# # Check if the Google Drive folder exists. If it does, use it as the cache_dir.
# # If not, set cache_dir to None to use the default Hugging Face cache location.
# if os.path.exists(drive_path):
#     cache_dir = cache_dir_path
#     os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists
# else:
#     cache_dir = None

In [4]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

### Jupyter Mounting (not Google Colab)

In [5]:
cache_dir=''

# Setup and Install
- It's best to run Llama models on a GPU, which you can do using a free Colab notebook.
- Check the Google Colab runtime to the top right corner.
- Or, go to the menu -> Runtime -> Change Runtime Type.
- Select GPU (T4).

In [6]:
# Set the runtime to cpu or gpu. fLlama 7B (or 13B) requires too much RAM to work on cpu alone on a free or PRO Colab notebook - so use runtime = "gpu".
runtimeFlag = "cuda:0"

In [7]:
### DEFINE THE HUGGING SPACE MODEL

# Dictionary to hold model names
models_dict = {
    # "A": "TheBloke/Llama-2-7B-chat-AWQ",
    "A": "TheBloke/Llama-2-70B-chat-AWQ",
    # "B": "TheBloke/Llama-2-13B-chat-AWQ",
    # "C": "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
    # Add more models here
}

# Dictionary to hold the initialized models and tokenizers
initialized_models = {}
initialized_tokenizers = {}

# # 1.1B model
# model_id = "PY007/TinyLlama-1.1B-Chat-v0.1"

### Install

In [8]:
!pip install autoawq

Collecting autoawq
  Downloading autoawq-0.1.1-cp310-cp310-manylinux2014_x86_64.whl (17.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.4/17.4 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting transformers>=4.32.0 (from autoawq)
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m87.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCollecting tokenizers>=0.12.1 (from autoawq)
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m96.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hCollecting accelerate (from autoawq)
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m71.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sen

In [9]:
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-yi669jvt
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-yi669jvt
  Resolved https://github.com/huggingface/transformers.git to commit 0b192de1f353b0e04dad4813e02e2c672de077be
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting tokenizers<0.15,>=0.14 (from transformers==4.34.0.dev0)
  Using cached tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers==4.34.0.dev0)
  Using cached huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25ldone
[?25h  Created wh

### Import

In [10]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

## Load Model
This can take 5 mins, which is why connecting Google Drive for caching is recommended if using Colab. The next time you run, it will be much faster because your model will only need to load checkpoint shards rather than the full model from HuggingFace.

In [11]:
# Initialize models and tokenizers
for key in sorted(models_dict.keys()):  # Sorting keys to go from A upwards through the alphabet
    model_name = models_dict[key]
    
    model = AutoAWQForCausalLM.from_quantized(
        model_name, 
        fuse_layers=True,
        trust_remote_code=False, 
        safetensors=True
    )
    
    tokenizer = AutoTokenizer.from_pretrained(
        model_name, 
        trust_remote_code=False
    )

    # Store in the initialized dictionaries
    initialized_models[f"model_{key}"] = model
    initialized_tokenizers[f"tokenizer_{key}"] = tokenizer

Downloading (…)lve/main/config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

Downloading (…)4dbe3964/LICENSE.txt:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading (…)e3964/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)be3964/USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading (…)5c2b34dbe3964/Notice:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)of-00004.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00004.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)of-00004.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)b34dbe3964/README.md:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)of-00004.safetensors:   0%|          | 0.00/6.87G [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/159k [00:00<?, ?B/s]

Downloading (…)64/quant_config.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading (…)e3964/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Replacing layers...: 100%|██████████| 80/80 [00:13<00:00,  6.04it/s]
Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.


# Inference (Simple stream)



In [12]:
from IPython.display import display, HTML

# Define a stream *without* function calling capabilities
def generate(model, tokenizer, user_prompt):
    system_prompt = ''

    # # Guanaco style for TinyLlama
    # B_INST, E_INST = "### Human:", "### Assistant:"
    # B_SYS, E_SYS = "\n", "\n\n"

    # # Llama style (with no system message)
    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "", ""
    
    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

    inputs = tokenizer([prompt], return_tensors="pt").input_ids.cuda()
    result = model.generate(inputs, max_new_tokens=150, do_sample=False)
    
    # Decode the generated text back to readable string
    result_str = tokenizer.decode(result[0], skip_special_tokens=True)
    
    return result_str

In [13]:
prompt = 'List the planets in our solar system. Respond only with the list of planets.'

# Loop through sorted keys (sorted by model names A, B, C, ...)
for key in sorted(initialized_models.keys()):
    model = initialized_models[key]
    tokenizer = initialized_tokenizers[f"tokenizer_{key[-1]}"]  # assuming the key is of format "model_A", "model_B", etc.
    model_name = models_dict[key[-1]]  # assuming key[-1] gives us the suffix A, B, C, ... to look up in models_dict

    display(HTML(f"<b>{model_name}:</b><br>"))
    
    result = generate(model, tokenizer, prompt)
    
    print(result)

TypeError: QuantAttentionFused.forward() got an unexpected keyword argument 'padding_mask'

# Evaluation
Evaluation is done with three questions:
1. Return a sequence of letters in reverse
2. Passkey Retrieval
3. Code generation

## 1. Return a sequence in reverse

In [None]:
import random
import string

max_sequence_length = 20
initial_sequence = 'ab'

for model_name, model, tokenizer in [(model_name_A, model_A, tokenizer_A), 
                                     (model_name_B, model_B, tokenizer_B),
                                     (model_name_C, model_C, tokenizer_C)
                                    ]:
    display(HTML(f"<b>{model_name}:</b><br>"))
    sequence = str(initial_sequence)  # Explicitly cast to string
    for i in range(max_sequence_length - len(str(initial_sequence)) + 1):  # Explicitly cast to string
        prompt = f'Respond with the following sequence in reverse: {sequence}'
        result = generate(model, tokenizer, prompt)

        joined_result = ''.join(result.split())
        
        # Check if the result contains the reversed sequence
        if str(sequence)[::-1] in joined_result:  # Explicitly cast to string
            print(f"{result}\nSuccess for sequence of length {i+2}: {sequence}.\n\n")
        else:
            print(f"{result}\nFailure for sequence of length {i+2}: {sequence}.\n\n")
            break
        
        # Extend the sequence by adding a random alphanumeric character
        random_char = random.choice(string.ascii_letters + string.digits)
        sequence = str(sequence) + random_char  # Explicitly cast to string

## 2. Code Generation

In [None]:
display(HTML(f"<b>{model_name_A}:</b><br>"))
n = 10
prompt = f'Respond directly with a snippet of python code that prints the first {n} numbers in the Fibonacci series.'

result = generate(model_A,tokenizer_A,prompt)
print(result)

display(HTML(f"<br><b>{model_name_B}:</b><br>"))
result = generate(model_B,tokenizer_B,prompt)
print(result)

display(HTML(f"<br><b>{model_name_C}:</b><br>"))
result = generate(model_C,tokenizer_C,prompt)
print(result)

## 3. Passkey Retrieval

In [None]:
passkey = 'passkey-u89dsnakj8'
text_file = 'berkshire23.txt'
n = 5
len_limit = 1000*4 #roughly a 1,000 context length of tokens

# Read the text from the file
with open(text_file, 'r') as file:
    text = file.read()

# Split the text into words
words = text.split()

# Insert the passkey after the nth word
words.insert(n, passkey)

# Join back into a string
modified_text = ' '.join(words)

# Truncate to 'len_limit' characters
modified_text = modified_text[:len_limit]

# Test with Model A
display(HTML(f"<b>{model_name_A}:</b><br>"))
prompt = f'{modified_text}\n\nRespond with the passkey contained within the above text.'
result = generate(model_A, tokenizer_A, prompt)  # Replace with your actual generate function
print(result)

# Test with Model B
display(HTML(f"<br><b>{model_name_B}:</b><br>"))
result = generate(model_B, tokenizer_B, prompt)  # Replace with your actual generate function
print(result)

# Test with Model C
display(HTML(f"<br><b>{model_name_C}:</b><br>"))
result = generate(model_C, tokenizer_C, prompt)  # Replace with your actual generate function
print(result)