In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [2]:
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

In [4]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

In [6]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

You are not running the flash-attention implementation, expect numerical differences.


 Why did the chicken join the band? Because it had the drumsticks!


In [7]:
print(generator.model.device)


cuda:0


In [8]:
import torch

print(torch.cuda.is_available())  # Returns True if a GPU is available
print(torch.cuda.current_device())  # Index of the current GPU (if available)
print(torch.cuda.get_device_name(0))  # Name of the GPU (if available)

True
0
NVIDIA GeForce RTX 4090


In [11]:
from transformers import AutoModelForCausalLM

# Get the number of parameters
num_params = model.num_parameters()
print(f"Number of parameters: {num_params/1e9:.2f}B")

Number of parameters: 3.82B


In [16]:
# Assuming FP16 precision
dtype_size_in_bytes = 2  # FP16 (float16)
total_memory_in_bytes = num_params * dtype_size_in_bytes
total_memory_in_MB = int(total_memory_in_bytes / (1024 ** 2))
print(f"{total_memory_in_MB:,} MB")

7,288 MB


# GPU memory
The total GPU memory occupied is equivalent to the reserved memory, as it includes both:

- Allocated Memory: The memory actively being used for tensors (e.g., model parameters, activations, gradients).
- Unused Reserved Memory: The portion of reserved memory that is not currently being used but has been pre-allocated by the framework for potential future operations.

## Why Does This Happen?
Frameworks like PyTorch reserve memory in larger blocks to:

- Avoid frequent allocations and deallocations, which can lead to fragmentation and performance degradation.
- Ensure efficient memory management by keeping a buffer pool for future operations.

## Examples
If you're using a transformer model for text generation:

- Allocated Memory:
    - Model weights for transformer layers.
    - Activations for each layer (e.g., attention outputs, hidden states).
- Reserved Memory:
    - Buffers for matrix multiplications in attention computations.
    - Memory for operations like softmax, layer normalization, and dropout.


In [19]:
# Check the allocated memory
print(f"Allocated memory: {torch.cuda.memory_allocated() / (1024 ** 2):.1f} MB")
print(f"Reserved memory: {torch.cuda.memory_reserved() / (1024 ** 2):.1f} MB")

Allocated memory: 7296.5 MB
Reserved memory: 7334.0 MB
