# On device optimisation
Hello! I'm trying out optimising LLMs on my MacBook Pro M1.

The steps are:
1. Find the largest Qwen2.5 model that runs on my MacBook without quantisation using HuggingFace. Track the speed with a standard prompt.
2. Optimise the deployment.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

import time

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
prompt = '''
 Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute? Pick the most realistic answer option.

A) 5 B) 11 C) 0 D) 20
'''

In [3]:
model_name = "Qwen/Qwen2.5-7B-Instruct"

In [4]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="mps"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Loading checkpoint shards: 100%|██████████| 4/4 [00:19<00:00,  4.75s/it]


In [5]:
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

In [6]:
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

In [7]:
time_to_run_list = []

for i in range(5):
    model_start_time = time.time()

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512
    )

    model_end_time = time.time()

    time_to_run = model_end_time - model_start_time
    time_to_run_list.append(time_to_run)


KeyboardInterrupt: 

In [None]:
print(f"LLM text generation took an average of {sum(time_to_run_list)/len(time_to_run_list):.2f} seconds")

LLM text generation took an average of 73.11 seconds


In [None]:
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [None]:
response

"To solve this problem, we need to calculate the total number of ice cubes placed in the pan over the three minutes and use the given average number of ice cubes per minute.\n\nHere's the breakdown:\n- In the first minute, 4 ice cubes were placed.\n- In the second minute, 5 ice cubes were placed.\n- In the third minute, an unknown number of ice cubes were placed, let's call this number \\( x \\).\n- In the fourth minute, no ice cubes were placed.\n\nThe average number of ice cubes per minute over the three minutes is given as 5. The formula for the average is:\n\n\\[\n\\text{Average} = \\frac{\\text{Total number of ice cubes}}{\\text{Number of minutes}}\n\\]\n\nGiven that the average is 5 over 3 minutes:\n\n\\[\n5 = \\frac{4 + 5 + x}{3}\n\\]\n\nSolving for \\( x \\):\n\n\\[\n15 = 4 + 5 + x\n\\]\n\\[\n15 = 9 + x\n\\]\n\\[\nx = 6\n\\]\n\nSo, 6 ice cubes were placed in the third minute. \n\nNow, let's sum up the total number of ice cubes in the pan at the end of the third minute:\n\n\\[\n

## Log
- 0.5B: 22.11s
- 1.5B: 44.59s
- 3B: 73.11s
- 7B: Running for 38 minutes during the 5 loop, decided to cancel it.

Stick with 3B as the model that I optimise for.