# Different ways to call the Llama models offline

**Information**

To call the `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-3B-Instruct`  or `meta-llama/Llama-3.2-1B-Instruct`  models from Hugging Face offline, you can use several different methods depending on your preferences and technical requirements. 

I am only showing here one of the most common approaches:

1. Using the Hugging Face Transformers Library


Remark: Llama models are published under the **META LLAMA 3 COMMUNITY LICENSE AGREEMENT**. The Meta Llama 3 Community License grants users a non-exclusive, royalty-free license(you not need to pay ongoing fees) to use, modify, and distribute Llama 3 materials, with requirements for attribution and naming conventions when creating derivative works. Users with over 700 million monthly active users need a separate license, and Meta disclaims all warranties and limits liability for any use of the materials.


*** 
**Background information**

* See the single model pages in the subsection
* You could also run the `meta-llama/Meta-Llama-3-70B-Instruct`, see model page: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
    + Hugging Face documentation llama 3 models: https://huggingface.co/docs/transformers/main/en/model_doc/llama3


for more details checkout the documentation of the transformers library: https://huggingface.co/docs/hub/transformers; whereby pipelines simply everything (you only need to specifiy the task, everything in the background is automatically specified): https://huggingface.co/docs/transformers/main_classes/pipelines


It is highly recommend to **use a small LLM in general**, because using some approaches the LLMs is downloaded and stored on your computer (the model weights). It is highly likely that you do not have sufficient CPU and RAM and storage or your disk to run the larger (70B) model locally! See Llama 3.1 requirements: https://llamaimodel.com/requirements/



***
**Aim of the code template**

Exemplify different approaches to call Llama (LLMs) offline.

# Check if you can use your GPU and need to fall back on your CPU

for Non-Macs:

> torch.cuda.is_available(): This function checks if a CUDA-compatible GPU is available on the system. CUDA, or Compute Unified Device Architecture, is NVIDIA’s parallel computing architecture. If torch.cuda.is_available() returns True, it means you have a CUDA-enabled GPU, and PyTorch can use it to accelerate computations.

for Macs:

> torch.backends.mps.is_available(): This function checks if the system supports Apple’s Metal Performance Shaders (MPS) backend, an alternative to CUDA on Apple hardware. MPS is designed for Apple’s GPUs, especially on Macs with M1/M2 chips. If torch.backends.mps.is_available() returns True, PyTorch can leverage Apple's MPS for faster computations on those GPUs.


**Aim:** By checking these options, the code determines the best available hardware accelerator for running PyTorch computations, allowing the model to run faster on supported GPUs compared to CPUs.

In [None]:
import torch

# check if you can use a Graphics Processing Unit (GPU) else use your Central Processing Unit (CPU):
#> Compute Unified Device Architecture
print("torch.cuda.is_available():", torch.cuda.is_available())
#> Apple Metal Performance SHader
print("torch.backends.mps.is_available()", torch.backends.mps.is_available())

# Environment Setup

## Load necessary libraries:

In [None]:
# loaded within the single code chunks

## Get API key(s)


normally not needed to provide your huggingface key:

In [None]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

Create simple prompts, which is identical for all of the following approaches:

In [None]:
# Create prompts
system_content = "You are a helpful assistant specialized on animal names."
user_content = """
 Please write down five animals, provide only the names seperated by comma (\,).
"""

# Local approaches using the Hugging Face Transformers Library

see documentation: https://huggingface.co/docs/transformers/index

## for text generation / summarization


### lama-3.1-8B-Instruct
see: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct


code adapted from: https://huggingface.co/blog/llama31

> The following snippet shows how to use meta-llama/Meta-Llama-3.1-8B-Instruct. It requires about 16 GB of VRAM, which fits many consumer GPUs. The same snippet works for meta-llama/Meta-Llama-3.1-70B-Instruct, which, at 140GB of VRAM & meta-llama/Meta-Llama-3.1-405B-Instruct (requiring 810GB VRAM), makes it a very interesting model for production use cases. Memory consumption can be further reduced by loading in 8-bit or 4-bit mode.

In [None]:
from transformers import pipeline
import torch
import time


# Start the timer
start_time = time.time()

model_id = "meta-llama/Llama-3.1-8B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cpu" # Use CPU
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)


# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

Remark: IQLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning, see for details: https://huggingface.co/blog/4bit-transformers-bitsandbytes

In [None]:
if torch.cuda.is_available():
  pipe = pipeline(
      "text-generation",
      model=model_id,
        model_kwargs={
          "torch_dtype": torch.bfloat16,
          "quantization_config": {"load_in_4bit": True}
      },
      device="cpu",  # Use CPU
  )
else: 
  print("not possible on your hardware")

### Llama-3.2-3B-Instruct
see: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

In [None]:
from transformers import pipeline
import torch
import time


# Start the timer
start_time = time.time()

model_id = "meta-llama/Llama-3.2-3B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cpu" # Use CPU
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)


# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

In [11]:
from transformers import pipeline
import torch

# Define the local model function
def local_model_call(prompt_template, context, question, model_id="meta-llama/Llama-3.2-3B-Instruct", max_tokens=256):
    # Combine the prompt_template, context, and question
    combined_prompt = prompt_template.format(context=context, question=question)
    
    # Initialize the model pipeline
    pipe = pipeline(
        "text-generation",
        model=model_id,
        model_kwargs={"torch_dtype": torch.bfloat16},
        device="cpu"  # Use CPU
    )

    # Call the model with the combined prompt
    outputs = pipe(
        combined_prompt,
        max_new_tokens=max_tokens,
        do_sample=False,
    )
    
    # Get the generated response (outputs is a list of strings)
    response = outputs # outputs[0]  # Directly take the string from the list
    return response

# Your custom PROMPT_TEMPLATE
PROMPT_TEMPLATE = """
Write a concise, well-structured scientific paragraph based strictly on the following guidelines:

Structure:
1. **Topic Sentence (Beginning)** – Clearly introduce the main idea or core message of the paragraph concisely and scientifically. It must directly relate to the question and the context.
2. **Body (Middle)** – Develop the idea with logical reasoning, relevant evidence, explanations, and examples drawn from the provided context. Ensure a coherent flow of sentences.
3. **Conclusion (End)** – Summarize or reinforce the main point, aligning it with the topic sentence without introducing new ideas.

Additional instructions:
- **Only use the provided context** to generate the response. No external knowledge should be included.
- The paragraph must **directly answer the research question** using evidence and information from the context.
- Maintain a **formal, precise, and scientific** tone.
- The paragraph must focus on the research question and **follow a logical flow**.

- **Do not** explicitly mention structural terms like "topic sentence," "body," or "conclusion."
- **Do not** include introductory phrases such as "Based on the provided context" or "The literature suggests." 

- **Return only the scientific paragraph** with no additional commentary.
- **Do not repeat the structural instructions** or the **additional instructions** in the response.

---

**Context:**  
{context}

**Research Question:**  
{question}

---

**Scientific Paragraph Response:**  
"""

# Example context and question
context = """
Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self-correction. AI has applications in various fields, such as robotics, machine learning, and natural language processing.
"""
question = "What is Artificial Intelligence?"

# Call the local model function
response = local_model_call(PROMPT_TEMPLATE, context, question)

# Print the result
print("Model Response:")
print(response)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Model Response:
[{'generated_text': '\nWrite a concise, well-structured scientific paragraph based strictly on the following guidelines:\n\nStructure:\n1. **Topic Sentence (Beginning)** – Clearly introduce the main idea or core message of the paragraph concisely and scientifically. It must directly relate to the question and the context.\n2. **Body (Middle)** – Develop the idea with logical reasoning, relevant evidence, explanations, and examples drawn from the provided context. Ensure a coherent flow of sentences.\n3. **Conclusion (End)** – Summarize or reinforce the main point, aligning it with the topic sentence without introducing new ideas.\n\nAdditional instructions:\n- **Only use the provided context** to generate the response. No external knowledge should be included.\n- The paragraph must **directly answer the research question** using evidence and information from the context.\n- Maintain a **formal, precise, and scientific** tone.\n- The paragraph must focus on the research 

In [15]:
response[0]["generated_text"]

'\nWrite a concise, well-structured scientific paragraph based strictly on the following guidelines:\n\nStructure:\n1. **Topic Sentence (Beginning)** – Clearly introduce the main idea or core message of the paragraph concisely and scientifically. It must directly relate to the question and the context.\n2. **Body (Middle)** – Develop the idea with logical reasoning, relevant evidence, explanations, and examples drawn from the provided context. Ensure a coherent flow of sentences.\n3. **Conclusion (End)** – Summarize or reinforce the main point, aligning it with the topic sentence without introducing new ideas.\n\nAdditional instructions:\n- **Only use the provided context** to generate the response. No external knowledge should be included.\n- The paragraph must **directly answer the research question** using evidence and information from the context.\n- Maintain a **formal, precise, and scientific** tone.\n- The paragraph must focus on the research question and **follow a logical flow

### lama-3.2-1B-Instruct

see: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

In [None]:
from transformers import pipeline
import torch
import time


# Start the timer
start_time = time.time()

model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cpu" # Use CPU
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)


# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

## for generating word embeddings


see docs: https://huggingface.co/sentence-transformers


however I call a relative weak model:

![embedder model](pics/relative%20weak%20model.JPG)

see leaderboard: https://huggingface.co/spaces/mteb/leaderboard

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer

# Define sentences
sentences = [
    "I feel great this morning",
    "I am feeling very good today",
    "I am feeling terrible"
]

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract features
features = model.encode(sentences)


# Print the features as a pandas dataframe
pd.DataFrame(features, index=sentences)

In [None]:
similarities = model.similarity(features, features)
print(similarities)

## for converting text to aduio (Text2Speech)

see: https://huggingface.co/suno/bark-small


* documentation on GitHub: https://github.com/suno-ai/bark
* for different speakers, see: https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c

In [None]:
from transformers import AutoProcessor, AutoModel
import time


# Start the timer
start_time = time.time()


processor = AutoProcessor.from_pretrained("suno/bark-small")
model = AutoModel.from_pretrained("suno/bark-small")

voice_preset = "v2/en_speaker_5"

inputs = processor(
    text=["Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
    voice_preset=voice_preset
)

speech_values = model.generate(**inputs, do_sample=True)

# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

In [None]:
from IPython.display import Audio

# play within Jupyter notebook:
sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)

In [None]:
import soundfile as sf
# Save as a WAV file
sf.write("sample_suno.wav", speech_values.cpu().numpy().squeeze(), sampling_rate)

## for converting audio to text (Speech2Text)

see: https://huggingface.co/openai/whisper-large-v3-turbo

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import soundfile as sf
import time


# Start the timer
start_time = time.time()



device = "cuda:0" if torch.cuda.is_available() else "cpu"
print("device:", device)

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    # generate_kwargs={"task": "translate", "language": "german"} # translate the audio file, else whisper predicts the language of the source audio automatically and the source audio language is the same as the target text language.
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]


# Get the audio sample data
audio_array = sample["array"]
sample_rate = sample["sampling_rate"]
# Save as a WAV file
sf.write("sample_audio.wav", audio_array, sample_rate)

# Transcripe to text
result = pipe(sample)
print(result["text"])


# End the timer and print the runtime
end_time = time.time()
print(f"Execution time: {end_time - start_time:.2f} seconds")

## for classification tasks


### small LLM (67M params) for fine-tuning

see: https://huggingface.co/distilbert/distilbert-base-uncased
and documentation: https://huggingface.co/docs/transformers/model_doc/distilbert

> a small, fast, cheap and light Transformer model trained by distilling BERT base, can be used either for masked language modeling or next sentence prediction, but it's mostly **intended to be fine-tuned on a downstream task**

In [None]:
import torch
from transformers import AutoModel

model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

### small LLM (67M params) for fine-tuning


see: https://huggingface.co/siebert/sentiment-roberta-large-english
    + model predicts if a text has either positive (1) or negative (0) sentiment

> this model ("SiEBERT", prefix for "Sentiment in English") is a fine-tuned checkpoint of RoBERTa-large (Liu et al. 2019). It enables reliable binary sentiment analysis for various types of English-language text.

In [None]:
from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english")
print(sentiment_analysis("I love this!"))
# do not trust blindly the model:
print(sentiment_analysis("Sometimes I love this, sometimes I hate this!"))
print(sentiment_analysis("Sometimes I hate this, sometimes I love this!"))

# get information which models you have downloaded:

see: https://huggingface.co/docs/huggingface_hub/guides/manage-cache

in terminal:

```

huggingface-cli scan-cache

```