# About this notebook


In this notebook you will load `tiiuae/falcon-7b` from `HuggingFace` with quantization and validate how much resources the model needs to be run.


# Imports

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [5]:

model_id = "tiiuae/falcon-7b"

# Ensure CUDA is available
if torch.cuda.is_available():
    # Reset peak memory statistics
    torch.cuda.reset_peak_memory_stats()

    # Capture initial GPU memory usage
    device = torch.device("cuda")
    initial_memory = torch.cuda.memory_allocated(device)

    # BitsAndBytes configuration
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        load_in_8bit=False, # You can optionally load it in 8bit
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type="fp4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    # Load tokenizer and model with BnB configuration
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

    # Capture GPU memory usage after loading the model
    final_memory = torch.cuda.memory_allocated(device) / (1024**2)  # Convert bytes to MB and then to GB
    peak_memory = torch.cuda.max_memory_allocated(device) / (1024**2)  # Peak memory during the process in GB

    # Calculate the difference
    memory_difference = final_memory - initial_memory

    print(f"Initial GPU Memory Usage: {initial_memory / 1024} GB")
    print(f"Final GPU Memory Usage: {final_memory / 1024} GB")
    print(f"Memory Difference (Model Load Impact): {memory_difference / 1024} GB")
    print(f"Peak GPU Memory Usage: {peak_memory / 1024} GB")
else:
    print("CUDA is not available. Please check your PyTorch and GPU setup.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Initial GPU Memory Usage: 0.0 GB
Final GPU Memory Usage: 4.094881057739258 GB
Memory Difference (Model Load Impact): 4.094881057739258 GB
Peak GPU Memory Usage: 4.630022048950195 GB
