# Different ways to call the Llama models offline

**Information**

To call the `meta-llama/Llama-3.1-8B-Instruct` model from Hugging Face offline, you can use several different methods depending on your preferences and technical requirements. 

Here are the most common approaches:

1. Using the Hugging Face Transformers Library



Remark: Llama models are published under the **META LLAMA 3 COMMUNITY LICENSE AGREEMENT**. The Meta Llama 3 Community License grants users a non-exclusive, royalty-free license(you not need to pay ongoing fees) to use, modify, and distribute Llama 3 materials, with requirements for attribution and naming conventions when creating derivative works. Users with over 700 million monthly active users need a separate license, and Meta disclaims all warranties and limits liability for any use of the materials.


*** 
**Background information**

* ...


***
**Coding sources**

* You can run the `meta-llama/Llama-3.1-8B-Instruct`, see model page: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
* You could also run the `meta-llama/Meta-Llama-3-70B-Instruct`, see model page: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
    + Hugging Face documentation: https://huggingface.co/docs/transformers/main/en/model_doc/llama3



***
**Aim of the code template**

Exemplify different approaches to call Llama (LLMs) offline.

# Environment Setup

## Define model

It is highly recommend to use a small LLM, because using some approaches the LLMs is downloaded and stored on your computer (the model weights). It is highly likely that you do not have sufficient CPU and RAM and storage or your disk to run the larger (70B) model locally! See Llama 3.1 requirements: https://llamaimodel.com/requirements/

In [1]:
model_name = "meta-llama/Llama-3.1-8B-Instruct" # meta-llama/Meta-Llama-3-70B-Instruct

## Load necessary libraries:

In [None]:
# loaded within the single code chunks

## Get API key(s)

In [3]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

Create simple prompts, which is identical for all of the following approaches:

In [4]:
# Create prompts
system_content = "You are a helpful assistant specialized on animal names."
user_content = """
 Please write down five animals, provide only the names seperated by comma (\,).
"""

# Local approaches 

## Using the Hugging Face Transformers Library

In [8]:
ERROR

NameError: name 'ERROR' is not defined

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


# Load model and tokenizer with disk offload
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map="auto", 
                                             torch_dtype=torch.float16, 
                                             disk_offload=True)  # Enable disk offload

# Generate text
input_text = "Once upon a time,"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_length=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))


TypeError: LlamaForCausalLM.__init__() got an unexpected keyword argument 'disk_offload'

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

# Generate text
input_text = "Once upon a time,"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_length=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.

aaa

In [14]:
from transformers import AutoTokenizer

# Try a different model
tokenizer = AutoTokenizer.from_pretrained("gpt2")


  # is changed.


In [12]:
from transformers import cached_file
cached_file.clear_cache()


ImportError: cannot import name 'cached_file' from 'transformers' (c:\Users\fenn\AppData\Local\R-MINI~1\Lib\site-packages\transformers\__init__.py)

In [1]:
from transformers import file_utils
print(file_utils.default_cache_path)

C:\Users\fenn\.cache\huggingface\hub


In [13]:
from transformers import LlamaTokenizer

LlamaTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", force_download=True, token=key.hugging_api_key_pro2)


tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizer'.


TypeError: not a string

In [20]:
from transformers import LlamaTokenizer

try:
    tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", token=key.hugging_api_key_pro2)
except OSError as e:
    print("Error loading tokenizer:", e)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizer'.


TypeError: not a string

In [4]:
# Use a pipeline as a high-level helper


from transformers import pipeline

messages = [{"role": "user", "content": "Who are you?"}]
pipe = pipeline("text-generation", model=model_name, token=key.hugging_api_key_pro)
response = pipe("Who are you?", max_length=50)
print(response[0]["generated_text"])
print(pipe(messages))


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


KeyboardInterrupt: 

In [26]:
from transformers import pipeline

# Simple test to ensure the pipeline works
pipe = pipeline("text-generation", model="gpt2")
result = pipe("Hello, I'm a model,", max_length=50)
print(result)


RuntimeError: Failed to import transformers.models.gpt2.modeling_gpt2 because of the following error (look up to see its traceback):
cannot import name 'is_torchao_available' from 'transformers.utils' (c:\Users\fenn\AppData\Local\R-MINI~1\Lib\site-packages\transformers\utils\__init__.py)

In [25]:
from .utils import (
    ADAPTER_CONFIG_NAME,
    ADAPTER_SAFE_WEIGHTS_NAME,
    ADAPTER_WEIGHTS_NAME,
    CONFIG_NAME,
    SAFE_WEIGHTS_INDEX_NAME,
    SAFE_WEIGHTS_NAME,
    WEIGHTS_INDEX_NAME,
    WEIGHTS_NAME,
    XLA_FSDPV2_MIN_VERSION,
    PushInProgress,
    PushToHubMixin,
    can_return_loss,
    find_labels,
    is_accelerate_available,
    is_apex_available,
    is_bitsandbytes_available,
    is_datasets_available,
    is_galore_torch_available,
    is_grokadamw_available,
    is_in_notebook,
    is_ipex_available,
    is_liger_kernel_available,
    is_lomo_available,
    is_peft_available,
    is_safetensors_available,
    is_sagemaker_dp_enabled,
    is_sagemaker_mp_enabled,
    is_schedulefree_available,
    is_torch_compile_available,
    is_torch_mlu_available,
    is_torch_mps_available,
    is_torch_musa_available,
    is_torch_neuroncore_available,
    is_torch_npu_available,
    is_torch_xla_available,
    is_torch_xpu_available,
    is_torchao_available,
    logging,
    strtobool,
)

ImportError: attempted relative import with no known parent package

In [25]:
from huggingface_hub import InferenceClient
# additional imports
import textwrap


# Initialize client
client = InferenceClient(token=key.hugging_api_key_pro2)

# Create prompts
system_content = "You are a helpful assistant."
user_content = """
   Please name five animals living in the deep sea. Only name animals seperated by commas.
"""

# Feed prompts into model
output = client.chat_completion(
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ],
   model="meta-llama/Llama-3.1-8B-Instruct",
    max_tokens=500,
        stream=False,
        temperature=0.0
)

# Accessing the text in the output object
text = output.choices[0].message.content

# Printing the output in a more readable format
print('\n'.join(textwrap.wrap(text, 100)))

Anglerfish, vampire squid, frilled shark, giant isopod, deep-sea jellyfish, colosal squid.
