bitsandbytes (bnb) is a CUDA-backed library that provides k-bit quantization (8-bit, 4-bit, NF4/FP4) and memory-efficient 8-bit optimizers that let large models run / be fine-tuned using far less GPU memory

Quantization reduces model memory footprint so you can run larger models on smaller GPUs, speed up inference, or fine-tune big models without huge memory (optimizer) state. NF4 is often recommended for 4-bit because it preserves quality.

In [3]:
!pip install --upgrade pip
!pip install bitsandbytes transformers accelerate

Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.3
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.48.2-py3-none-manylinux_2_24_x86_64.whl (59.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m58.2 MB/s[0m  [33m0:00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.48.2


In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [5]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [6]:
MODEL = "meta-llama/Llama-2-7b-chat-hf"

In [7]:
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(MODEL, load_in_8bit=True, device_map="auto")

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [8]:
prompt = "Write a concise explanation of reinforcement learning."

inputs = tokenizer(prompt, return_tensors = 'pt').to(model.device)

In [9]:
out = model.generate(**inputs, max_new_tokens = 150)

In [10]:
print(tokenizer.decode(out[0], skip_special_tokens=True))

Write a concise explanation of reinforcement learning.
Reinforcement learning is a subfield of machine learning that focuses on training agents to make decisions in complex, uncertain environments. In reinforcement learning, an agent learns to make decisions by interacting with its environment and receiving feedback in the form of rewards or penalties. The goal of the agent is to learn a policy that maximizes the cumulative reward over time.
Reinforcement learning algorithms typically use trial and error to learn from experience, and they can be applied to a wide range of problems, including robotics, game playing, and autonomous driving. Some of the key challenges in reinforcement learning include dealing with partial observability (the agent only has access to a partial


## **`4-bit Compression`**

In [11]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

In [12]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # nf4 or fp4
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

In [13]:
bnb_config

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

In [14]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, device_map="auto", quantization_config=bnb_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
prompt = "Write a concise explanation of reinforcement learning."

inputs = tokenizer(prompt, return_tensors = 'pt').to(model.device)

In [16]:
out = model.generate(**inputs, max_new_tokens = 150)

In [17]:
print(tokenizer.decode(out[0], skip_special_tokens=True))

Write a concise explanation of reinforcement learning. Hinweis: Please provide a clear and concise explanation of reinforcement learning, including its key components and the main difference between reinforcement learning and other machine learning paradigms.
Reinforcement learning (RL) is a subfield of machine learning that focuses on training agents to make decisions in complex, uncertain environments. Unlike other machine learning paradigms, such as supervised and unsupervised learning, RL involves learning from feedback received through trial and error.
The key components of RL are:
Agent: The RL agent is the decision-making entity that interacts with the environment.
Environment: The environment is the external world that the agent interacts with.
Actions: The agent
