# Llama-2-7B-fp16 on HF-Transformers - Benchmark-Timing Running on MPS

512-token prompt, 128 token generation

Needs lots of RAM (fp16), tested on a M2 Max with 96GB RAM (it used up to 87GB RAM). On the M2 Max, running this takes approx. 2 minutes!

**Setup:**

In this notebook, we run the Llama-7B Chat model (fp16). [More details here.](https://open.substack.com/pub/kaitchup/p/run-llama-2-chat-models-on-your-computer?r=2kp66c&utm_campaign=post&utm_medium=web)


We need the following libraries installed:


*   huggingface transformers
*   accelerate (for device_map)

ONLY RUN THIS ONCE, then restart kernel to remove the no longer needed code from memory.


In [None]:
%pip install transformers accelerate

import torch
print("MPS is working: {}".format(torch.backends.mps.is_built()))

Note that to run the following code, you must have got access to Llama 2's weights and have an access token from Hugging Face. You can find instructions on the model cards on the hugging face hub: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf


In [1]:
import time
import os
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
mps_device = "mps"
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "../models.llama.cpp/Llama-2-7b-chat-hf"   # local model
#prompt = "Tell me about gravity"
prompt = " hello"
for _ in range(int(512-3)):
    prompt += " hello"

time_start=time.time()
model = AutoModelForCausalLM.from_pretrained(model_name, local_files_only=True, device_map=mps_device, )
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, )
time_loading=time.time()-time_start

print("Generating ...")
time_start=time.time()
model_inputs = tokenizer(prompt, return_tensors="pt").to(mps_device)
output = model.generate(**model_inputs, max_new_tokens=128)
time_inference=time.time()-time_start

print(tokenizer.decode(output[0], skip_special_tokens=True))

print()
print("Loading time: {:.2f} s".format(time_loading))
print("Inference time: {:.2f} s".format(time_inference))
print("Tokens prompt: {}".format(len(model_inputs.input_ids[0])))
print("Tokens output (incl. prompt): {}".format(len(output[0])))
print("Tokens per second: {:.2f}".format(len(output[0])/time_inference))

  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.86s/it]


Generating ...
 hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello 