# *LLM Performance Comparison*
---
**Running in Google Colab**

- Runtime -> Change runtime type.
- Choose a GPU runtime (with at least a T4 GPU, probably an A100 if you're comparing two models or more).
- You'll need 60 GB of disk space to download the full weights of Llama 7B, 13B and Mistral 7B.
- Run all cells.

---
Prepared by Trelis Research.

Find Trelis on [Github](https://github.com/TrelisResearch), [HuggingFace](https://huggingface.co/Trelis) and [YouTube](https://www.youtube.com/@TrelisResearch).



#### HuggingFace Login (optional)
- You don't need this if you are using Trelis Function Calling Llama 2 7B, which is public.
- You do need this to access private/gated repositories.

In [1]:
!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login()

[0m

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Google Drive Mounting (optional, but recommended)
This saves you time the next time you load the model.

If you don't use it, remove cache_dir from the model and tokeniser below.

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')


In [2]:
# import os

# # This is the path to the Google Drive folder.
# drive_path = "/content/drive"

# # This is the path where you want to store your cache.
# cache_dir_path = os.path.join(drive_path, "My Drive/huggingface_cache")

# # Check if the Google Drive folder exists. If it does, use it as the cache_dir.
# # If not, set cache_dir to None to use the default Hugging Face cache location.
# if os.path.exists(drive_path):
#     cache_dir = cache_dir_path
#     os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists
# else:
#     cache_dir = None

In [3]:
# # https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
# import locale
# def getpreferredencoding(do_setlocale = True):
#     return "UTF-8"
# locale.getpreferredencoding = getpreferredencoding

### Jupyter Mounting (not Google Colab)

In [1]:
cache_dir=''

# Setup and Install
- It's best to run Llama models on a GPU, which you can do using a free Colab notebook.
- Check the Google Colab runtime to the top right corner.
- Or, go to the menu -> Runtime -> Change Runtime Type.
- Select GPU (T4).

In [2]:
### DEFINE THE HUGGING SPACE MODEL

## Model A
# model_name_A = "Trelis/Llama-2-7b-chat-hf-32k"
# model_name_A = "Yukang/Llama-2-7b-longlora-100k-ft"
model_name_A = "Yukang/LongAlpaca-13B"

# ## Model B
model_name_B = "Trelis/Llama-2-7b-chat-hf-32k"
# model_name_B = "Yukang/Llama-2-13b-chat-longlora-32k-sft"

# ## Model C
# model_name_C = "mistralai/Mistral-7B-Instruct-v0.1"

# # 1.1B model
# model_id = "PY007/TinyLlama-1.1B-Chat-v0.1"

### Install

In [7]:
!python -m pip install --upgrade pip
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U einops
!pip install -q -U safetensors
!pip install -q -U torch
!pip install -q -U xformers
!pip install -q -U scipy

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.0.2+cu118 requires torch==2.0.1, but you have torch 2.1.0 which is incompatible.
torchvision 0.15.2+cu118 requires torch==2.0.1, but you have torch 2.1.0 which is incompatible.
xformers 0.0.22 requires torch==2.0.1, but you have torch 2.1.0 which is incompatible.[0m[31m
[0m

### Import

In [4]:
import transformers
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TextStreamer

## Load Model
This can take 5 mins, which is why connecting Google Drive for caching is recommended. The next time you run, it will be much faster because your model will only need to load checkpoint shards rather than the full model from HuggingFace.

In [5]:
# Load the model in 4-bit to allow it to fit in a free Google Colab runtime with a CPU and T4 GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True, #adds speed with minimal loss of quality.
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

## Model A
model_A = AutoModelForCausalLM.from_pretrained(
    model_name_A,
    quantization_config=bnb_config,
    device_map='auto',
    trust_remote_code=True,
    cache_dir=cache_dir)

# model_A.config.rope_scaling = {"type": "linear", "factor": 8}

# ## Model B
# model_B = AutoModelForCausalLM.from_pretrained(
#     model_name_B,
#     quantization_config=bnb_config,
#     device_map='auto',
#     trust_remote_code=True,
#     cache_dir=cache_dir)

# ## Model C
# model_C = AutoModelForCausalLM.from_pretrained(
#     model_name_C,
#     quantization_config=bnb_config,
#     device_map='auto',
#     trust_remote_code=True,
#     cache_dir=cache_dir)

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [6]:
# !pip install -q -U git+https://github.com/huggingface/peft.git

# from peft import PeftModel

# adapter_model = "Trelis/Llama-2-7b-chat-hf-function-calling-adapters-v2"

# # load perf model with new adapters
# model = PeftModel.from_pretrained(
#     model,
#     adapter_model,
# )

## Set up the Tokenizers

In [15]:
# Model A
tokenizer_A = AutoTokenizer.from_pretrained(model_name_A, cache_dir=cache_dir, use_fast=True) # will use the Rust fast tokenizer if available

# # Model B
# tokenizer_B = AutoTokenizer.from_pretrained(model_name_B, cache_dir=cache_dir, use_fast=True) # will use the Rust fast tokenizer if available

# # Model C
# tokenizer_C = AutoTokenizer.from_pretrained(model_name_C, cache_dir=cache_dir, use_fast=True) # will use the Rust fast tokenizer if available

ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

# Inference (Simple stream)



In [9]:
from IPython.display import display, HTML

# Define a stream *without* function calling capabilities
def generate(model, tokenizer, user_prompt):
    system_prompt = ''

    # # Guanaco style for TinyLlama
    # B_INST, E_INST = "### Human:", "### Assistant:"
    # B_SYS, E_SYS = "\n", "\n\n"

    # # Llama style (with no system message)
    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "", ""
    
    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"

    inputs = tokenizer([prompt], return_tensors="pt").to('cuda')
    shape = inputs.input_ids.shape
    print(f"Length of input is {shape[1]}")
    result = model.generate(**inputs, max_new_tokens=500, pad_token_id=tokenizer.eos_token_id, do_sample=False)
    
    # Decode the generated text back to readable string
    result_str = tokenizer.decode(result[0], skip_special_tokens=True)
    
    return result_str

In [10]:
prompt = 'List the planets in our solar system. Respond only with the list of planets.'

display(HTML(f"<b>{model_name_A}:</b><br>"))
result = generate(model_A,tokenizer_A,prompt)
print(result)

# display(HTML(f"<br><b>{model_name_B}:</b><br>"))
# result = generate(model_B,tokenizer_B,prompt)
# print(result)

# display(HTML(f"<br><b>{model_name_C}:</b><br>"))
# result = generate(model_C,tokenizer_C,prompt)
# print(result)

Length of input is 29
[INST] List the planets in our solar system. Respond only with the list of planets. [/INST]

Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune.


# Evaluation
Evaluation is done with three questions:
1. Return a sequence of letters in reverse
2. Passkey Retrieval
3. Code generation

## 1. Return a sequence in reverse

In [25]:
import random
import string

max_sequence_length = 20
initial_sequence = 'ab'

for model_name, model, tokenizer in [(model_name_A, model_A, tokenizer_A), 
                                     # (model_name_B, model_B, tokenizer_B),
                                     # (model_name_C, model_C, tokenizer_C)
                                    ]:
    display(HTML(f"<b>{model_name}:</b><br>"))
    sequence = str(initial_sequence)  # Explicitly cast to string
    for i in range(max_sequence_length - len(str(initial_sequence)) + 1):  # Explicitly cast to string
        prompt = f'Respond with the following sequence in reverse: {sequence}'
        result = generate(model, tokenizer, prompt)

        joined_result = ''.join(result.split())
        
        # Check if the result contains the reversed sequence
        if str(sequence)[::-1] in joined_result:  # Explicitly cast to string
            print(f"{result}\nSuccess for sequence of length {i+2}: {sequence}.\n\n")
        else:
            print(f"{result}\nFailure for sequence of length {i+2}: {sequence}.\n\n")
            break
        
        # Extend the sequence by adding a random alphanumeric character
        random_char = random.choice(string.ascii_letters + string.digits)
        sequence = str(sequence) + random_char  # Explicitly cast to string

Length of input is 20
[INST] Respond with the following sequence in reverse: ab [/INST]

The reverse of "ab" is "ba".
Success for sequence of length 2: ab.


Length of input is 21
[INST] Respond with the following sequence in reverse: ab1 [/INST]

The reverse of "ab1" is "ba1".
Failure for sequence of length 3: ab1.




## 2. Code Generation

In [26]:
display(HTML(f"<b>{model_name_A}:</b><br>"))
n = 10
prompt = f'Respond directly with a snippet of python code that prints the first {n} numbers in the Fibonacci series.'

result = generate(model_A,tokenizer_A,prompt)
print(result)

# display(HTML(f"<br><b>{model_name_B}:</b><br>"))
# result = generate(model_B,tokenizer_B,prompt)
# print(result)

# display(HTML(f"<br><b>{model_name_C}:</b><br>"))
# result = generate(model_C,tokenizer_C,prompt)
# print(result)

Length of input is 35
[INST] Respond directly with a snippet of python code that prints the first 10 numbers in the Fibonacci series. [/INST]

```
def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(10))
```
This code defines a function `fibonacci` that takes an integer `n` as input and returns the `n`th number in the Fibonacci series. The function uses recursion to calculate the Fibonacci numbers, with the base case being `n <= 1`, in which case the function returns the value of `n` directly. Otherwise, the function calls itself with `n-1` and `n-


In [27]:
def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

print(fibonacci(10))

55


## 3. Passkey Retrieval

In [14]:
passkey = 'passkey-u89dsnakj8'
text_file = 'berkshire23.txt'
n = 5
len_limit = int(16000*0.5)

# Read the text from the file
with open(text_file, 'r') as file:
    text = file.read()

# Split the text into words
words = text.split()

# Insert the passkey after the nth word
words.insert(n, passkey)

# Join back into a string
modified_text = ' '.join(words)

# Truncate to 'len_limit' characters
modified_text = modified_text[:len_limit]

# Test with Model A
display(HTML(f"<b>{model_name_A}:</b><br>"))
prompt = f'{modified_text}\n\nRespond with the passkey contained within the above text.'
# prompt = f'{modified_text}\n\nRespond only with a concise summary of the above text.'

result = generate(model_A, tokenizer_A, prompt)  # Replace with your actual generate function
print(result)

# # Test with Model B
# display(HTML(f"<br><b>{model_name_B}:</b><br>"))
# result = generate(model_B, tokenizer_B, prompt)  # Replace with your actual generate function
# print(result)

# # Test with Model C
# display(HTML(f"<br><b>{model_name_C}:</b><br>"))
# result = generate(model_C, tokenizer_C, prompt)  # Replace with your actual generate function
# print(result)

Length of input is 1915
[INST] we are here live in passkey-u89dsnakj8 Omaha Nebraska good morning everybody I'm Becky quick along with Mike santoli and in just 30 minutes time Berkshire Hathaway chairman and CEO Warren Buffett's going to be taking the stage with his vice chair Charlie Munger the legendary duo will also be joined by berkshire's two other Vice chairs Greg Abel who manages the non-insurance operations for the company and Ajit Jain who runs all of the insurance businesses and as always it's pretty big crowd here lots and lots of people and a few people you might notice too Tim Cook is here Apple of course is still berkshire's largest holding big big part of its portfolio there you see him backstage getting ready to go out and take his seat he gets to sit down in the special seats by the way that's Debbie pasonic Warren's assistant who's standing by just went bite beside him also in the crowd Bill Murray he has been here for a couple of days been hanging around you can chec

In [None]:
print(model_A.config)