This file loads the Granite 3.1 model and its tokenizer, tokenizes an input query, and ensures the tokens are moved to the correct device (CPU or MPS). It then generates a response from the model, decodes the output into human-readable text, and prints the result, with efficient device management for large models.

- Load the 1 billion parameter model and tokenizer from Hugging Face using AutoModelForCausalLM and AutoTokenizer.
- Automatically select the device (MPS for Apple Silicon or CPU) based on availability.
- Implement memory constraints for both MPS and CPU to avoid exceeding 8GB of system memory.
- Use accelerate's infer_auto_device_map and dispatch_model to optimize model layer distribution across devices.
- Prevent layer splitting for specific module classes like GPTNeoXLayer to improve performance and stability.


In [1]:
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map, dispatch_model

# Load the pre-trained model weights using the Hugging Face Transformers library
# "ibm-granite/granite-3.1-1b-a400m-base" is a model with 1 billion parameters
model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-3.1-1b-a400m-base")

# Define the path to the model, used to load both model weights and tokenizer

model_path = "ibm-granite/granite-3.1-1b-a400m-base" 

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Check if the MPS (Metal Performance Shaders) backend is available, useful for Apple Silicon devices (like M1, M2 chips)
# If MPS is available, use it, otherwise, fall back to CPU 
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

# Move the model to the selected device (MPS or CPU) based on availability
model.to(device)

# Define memory constraints for both MPS and CPU to avoid exceeding the system's memory limits (e.g., 4GB RAM limit)
# The model with 1 billion parameters requires careful memory management
max_memory = {
    "mps": "4GB",  # Allocate up to 4GB for MPS (GPU)
    "cpu": "4GB"   # Allocate up to 4GB for CPU
}

# Generate a device map for layer distribution across available devices (MPS and CPU)
# The model's layers will be split intelligently to ensure that memory usage stays within the def


  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.07s/it]


Added example for model inference with tokenization and text generation

- Tokenized the input query using the model's tokenizer.
- Moved input tokens to the same device as the model to ensure compatibility.
- Generated output tokens using the model and specified a maximum output length.
- Decoded the output tokens into human-readable text and printed the result.


In [2]:
# Import the tokenizer from Hugging Face
from transformers import AutoTokenizer

# Load the tokenizer for the pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Input text for the model
input_text = "What's the capital of India"

# Tokenize input text and return as PyTorch tensors
input_tokens = tokenizer(input_text, return_tensors="pt")

# Ensure input tokens are moved to the same device as the model
input_tokens = {key: value.to(next(model.parameters()).device) for key, value in input_tokens.items()}

# Generate output tokens from the model with a max length of 50
output_tokens = model.generate(**input_tokens, max_length=50)

# Decode the output tokens into readable text, skipping special tokens
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

# Print the generated output
print(output_text)


What's the capital of India?

The capital of India is New Delhi.

What is the capital of the United States?

The capital of the United States is Washington, D.C.

What
