# Different ways to call the Llama models offline

**Information**

To call the `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-3B-Instruct`  or `meta-llama/Llama-3.2-1B-Instruct`  models from Hugging Face offline, you can use several different methods depending on your preferences and technical requirements. 

I am only showing here one of the most common approaches:

1. Using the Hugging Face Transformers Library


Remark: Llama models are published under the **META LLAMA 3 COMMUNITY LICENSE AGREEMENT**. The Meta Llama 3 Community License grants users a non-exclusive, royalty-free license(you not need to pay ongoing fees) to use, modify, and distribute Llama 3 materials, with requirements for attribution and naming conventions when creating derivative works. Users with over 700 million monthly active users need a separate license, and Meta disclaims all warranties and limits liability for any use of the materials.


*** 
**Background information**

* See the single model pages in the subsection
* You could also run the `meta-llama/Meta-Llama-3-70B-Instruct`, see model page: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
    + Hugging Face documentation llama 3 models: https://huggingface.co/docs/transformers/main/en/model_doc/llama3


for more details checkout the documentation of the transformers library: https://huggingface.co/docs/hub/transformers; whereby pipelines simply everything (you only need to specifiy the task, everything in the background is automatically specified): https://huggingface.co/docs/transformers/main_classes/pipelines


It is highly recommend to **use a small LLM in general**, because using some approaches the LLMs is downloaded and stored on your computer (the model weights). It is highly likely that you do not have sufficient CPU and RAM and storage or your disk to run the larger (70B) model locally! See Llama 3.1 requirements: https://llamaimodel.com/requirements/



***
**Aim of the code template**

Exemplify different approaches to call Llama (LLMs) offline.

# Check if you can use your GPU and need to fall back on your CPU

for Non-Macs:

> torch.cuda.is_available(): This function checks if a CUDA-compatible GPU is available on the system. CUDA, or Compute Unified Device Architecture, is NVIDIA’s parallel computing architecture. If torch.cuda.is_available() returns True, it means you have a CUDA-enabled GPU, and PyTorch can use it to accelerate computations.

for Macs:

> torch.backends.mps.is_available(): This function checks if the system supports Apple’s Metal Performance Shaders (MPS) backend, an alternative to CUDA on Apple hardware. MPS is designed for Apple’s GPUs, especially on Macs with M1/M2 chips. If torch.backends.mps.is_available() returns True, PyTorch can leverage Apple's MPS for faster computations on those GPUs.


**Aim:** By checking these options, the code determines the best available hardware accelerator for running PyTorch computations, allowing the model to run faster on supported GPUs compared to CPUs.

In [1]:
import torch

# check if you can use a Graphics Processing Unit (GPU) else use your Central Processing Unit (CPU):
#> Compute Unified Device Architecture
print("torch.cuda.is_available():", torch.cuda.is_available())
#> Apple Metal Performance SHader
print("torch.backends.mps.is_available()", torch.backends.mps.is_available())

torch.cuda.is_available(): False
torch.backends.mps.is_available() False


# Environment Setup

## Load necessary libraries:

In [2]:
# loaded within the single code chunks

## Get API key(s)


normally not needed to provide your huggingface key:

In [2]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

Create simple prompts, which is identical for all of the following approaches:

In [3]:
# Create prompts
system_content = "You are a helpful assistant specialized on animal names."
user_content = """
 Please write down five animals, provide only the names seperated by comma (\,).
"""

  user_content = """


# Local approaches using the Hugging Face Transformers Library

see documentation: https://huggingface.co/docs/transformers/index

## for text generation / summarization


### lama-3.1-8B-Instruct
see: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct


code adapted from: https://huggingface.co/blog/llama31

> The following snippet shows how to use meta-llama/Meta-Llama-3.1-8B-Instruct. It requires about 16 GB of VRAM, which fits many consumer GPUs. The same snippet works for meta-llama/Meta-Llama-3.1-70B-Instruct, which, at 140GB of VRAM & meta-llama/Meta-Llama-3.1-405B-Instruct (requiring 810GB VRAM), makes it a very interesting model for production use cases. Memory consumption can be further reduced by loading in 8-bit or 4-bit mode.

In [4]:
from transformers import pipeline
import torch
import time


# Start the timer
start_time = time.time()

model_id = "meta-llama/Llama-3.1-8B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cpu" # Use CPU
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)


# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Arrrr, me hearty! I be Captain Chat, the scurvy dog o' a chatbot, at yer service! Me and me trusty keyboard be here to swab the decks o' yer questions and provide ye with the treasure o' knowledge ye seek! So hoist the sails and set course fer a swashbucklin' good time, matey!
Execution time: 49.79 seconds


Remark: IQLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning, see for details: https://huggingface.co/blog/4bit-transformers-bitsandbytes

In [6]:
if torch.cuda.is_available():
  pipe = pipeline(
      "text-generation",
      model=model_id,
        model_kwargs={
          "torch_dtype": torch.bfloat16,
          "quantization_config": {"load_in_4bit": True}
      },
      device="cpu",  # Use CPU
  )
else: 
  print("not possible on your hardware")

not possible on your hardware


### Llama-3.2-3B-Instruct
see: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

In [7]:
from transformers import pipeline
import torch
import time


# Start the timer
start_time = time.time()

model_id = "meta-llama/Llama-3.2-3B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cpu" # Use CPU
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)


# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Yer lookin' fer a swashbucklin' pirate, eh? Alright then, matey! I be Captain Clueless, the scurviest pirate chatbot to ever sail the Seven Seas... er, I mean, the digital seas! Me and me trusty parrot sidekick, Polly, be here to guide ye through treacherous waters o' knowledge and answer yer questions with a hearty "Shiver me timbers!"
Execution time: 24.96 seconds


### lama-3.2-1B-Instruct

see: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

In [8]:
from transformers import pipeline
import torch
import time


# Start the timer
start_time = time.time()

model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cpu" # Use CPU
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)


# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Arrr, me hearty! Yer lookin' fer a swashbucklin' chatbot, eh? Alright then, listen close and I'll tell ye about meself. Me name be Captain Chat, the greatest pirate of all time... or at least, I be tellin' meself. Me knowledge be vast, me wit be sharp, and me love fer words be stronger than me love fer a good bottle o' rum.

I be a master o' the digital seas, navigatin' the vast expanse o' the internet with ease. Me treasure be me vast knowledge base, filled with answers to all yer questions, from the history o' the high seas to the best ways to cook a proper sea dog stew.

So hoist the colors, me hearties, and set course fer a grand adventure with Captain Chat! What be yer question, matey?
Execution time: 17.75 seconds


## for generating word embeddings


see docs: https://huggingface.co/sentence-transformers


however I call a relative weak model:

![embedder model](pics/relative%20weak%20model.JPG)

see leaderboard: https://huggingface.co/spaces/mteb/leaderboard

In [9]:
import pandas as pd
from sentence_transformers import SentenceTransformer

# Define sentences
sentences = [
    "I feel great this morning",
    "I am feeling very good today",
    "I am feeling terrible"
]

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract features
features = model.encode(sentences)


# Print the features as a pandas dataframe
pd.DataFrame(features, index=sentences)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
I feel great this morning,-0.026462,-0.044373,0.072443,0.034525,0.089534,-0.050451,0.018811,0.071296,-0.020522,-0.043637,...,-0.005689,-0.000328,-0.049055,0.016308,-0.027642,0.017276,0.065253,0.017496,-0.02281,-0.036687
I am feeling very good today,-0.043895,-0.020341,0.066563,-0.00631,0.02598,-0.04042,0.079304,-0.0097,-0.04292,-0.025988,...,-0.045309,0.049151,-0.049057,0.017821,-0.018061,-0.010441,0.04307,0.01844,-0.008274,-0.006016
I am feeling terrible,0.017495,-0.057904,0.033315,0.00171,0.051957,-0.048159,0.007659,0.119096,0.029929,-0.06896,...,0.038813,0.003014,-0.074585,-0.018391,-0.026449,0.005867,0.051495,-0.009829,0.030009,-0.064299


In [10]:
similarities = model.similarity(features, features)
print(similarities)

tensor([[1.0000, 0.7923, 0.5926],
        [0.7923, 1.0000, 0.5782],
        [0.5926, 0.5782, 1.0000]])


## for converting text to aduio (Text2Speech)

see: https://huggingface.co/suno/bark-small


* documentation on GitHub: https://github.com/suno-ai/bark
* for different speakers, see: https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c

In [11]:
from transformers import AutoProcessor, AutoModel
import time


# Start the timer
start_time = time.time()


processor = AutoProcessor.from_pretrained("suno/bark-small")
model = AutoModel.from_pretrained("suno/bark-small")

voice_preset = "v2/en_speaker_5"

inputs = processor(
    text=["Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
    voice_preset=voice_preset
)

speech_values = model.generate(**inputs, do_sample=True)

# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)


en_speaker_5_semantic_prompt.npy:   0%|          | 0.00/2.52k [00:00<?, ?B/s]

en_speaker_5_coarse_prompt.npy:   0%|          | 0.00/7.31k [00:00<?, ?B/s]

en_speaker_5_fine_prompt.npy:   0%|          | 0.00/14.5k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Execution time: 127.92 seconds


In [12]:
from IPython.display import Audio

# play within Jupyter notebook:
sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)

In [13]:
import soundfile as sf
# Save as a WAV file
sf.write("sample_suno.wav", speech_values.cpu().numpy().squeeze(), sampling_rate)

## for converting audio to text (Speech2Text)

see: https://huggingface.co/openai/whisper-large-v3-turbo

In [14]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import soundfile as sf
import time


# Start the timer
start_time = time.time()



device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("device:", device)

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    # generate_kwargs={"task": "translate", "language": "german"} # translate the audio file, else whisper predicts the language of the source audio automatically and the source audio language is the same as the target text language.
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]


# Get the audio sample data
audio_array = sample["array"]
sample_rate = sample["sampling_rate"]
# Save as a WAV file
sf.write("sample_audio.wav", audio_array, sample_rate)

# Transcripe to text
result = pipe(sample)
print(result["text"])


# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

device: cpu


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Up Guards and Adam paintings, and Mason's exquisite idles are as national as a jingo poem. Mr. Birkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth, and Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, next man.
Execution time: 35.12 seconds


## for classification tasks


### small LLM (67M params) for fine-tuning

see: https://huggingface.co/distilbert/distilbert-base-uncased
and documentation: https://huggingface.co/docs/transformers/model_doc/distilbert

> a small, fast, cheap and light Transformer model trained by distilling BERT base, can be used either for masked language modeling or next sentence prediction, but it's mostly **intended to be fine-tuned on a downstream task**

In [15]:
import torch
from transformers import AutoModel

model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

### small LLM (67M params) for fine-tuning


see: https://huggingface.co/siebert/sentiment-roberta-large-english
    + model predicts if a text has either positive (1) or negative (0) sentiment

> this model ("SiEBERT", prefix for "Sentiment in English") is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text.

In [16]:
from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english")
print(sentiment_analysis("I love this!"))
# do not trust blindly the model:
print(sentiment_analysis("Sometimes I love this, sometimes I hate this!"))
print(sentiment_analysis("Sometimes I hate this, sometimes I love this!"))

[{'label': 'POSITIVE', 'score': 0.9988656044006348}]
[{'label': 'NEGATIVE', 'score': 0.9672611355781555}]
[{'label': 'POSITIVE', 'score': 0.9985952973365784}]


# get information which models you have downloaded:

see: https://huggingface.co/docs/huggingface_hub/guides/manage-cache

in terminal:

```

huggingface-cli scan-cache

```