# Models

Looking at the lower level API of Transformers - the models that wrap PyTorch code for the transformers themselves.

This notebook can run on a low-cost or free T4 runtime.


## One more reminder

**Pro-tip:**

In the middle of running a Colab, you might get an error like this:

> Runtime error: CUDA is required but not available for bitsandbytes. Please consider installing [...]

This is a super-misleading error message! Please don't try changing versions of packages...

This actually happens because Google has switched out your Colab runtime, perhaps because Google Colab was too busy. The solution is:

1. Kernel menu >> Disconnect and delete runtime
2. Reload the colab from fresh and Edit menu >> Clear All Outputs
3. Connect to a new T4 using the button at the top right
4. Select "View resources" from the menu on the top right to confirm you have a GPU
5. Rerun the cells in the colab, from the top down, starting with the pip installs

And all should work great - otherwise, ask me!

In [None]:
# !pip install -q --upgrade bitsandbytes accelerate

In [21]:
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gc

# Sign in to Hugging Face

1. If you haven't already done so, create a free HuggingFace account at https://huggingface.co and navigate to Settings, then Create a new API token, giving yourself write permissions by clicking on the WRITE tab

2. Press the "key" icon on the side panel to the left, and add a new secret:
`HF_TOKEN = your_token`

3. Execute the cell below to log in.

In [4]:
import os
from dotenv import load_dotenv
from huggingface_hub import login

load_dotenv(override=True)
hf_token = os.getenv('HF_TOKEN')

if hf_token and hf_token.startswith("hf_"):
  print("HF key looks good so far")
else:
  print("HF key is not set - please click the key in the left sidebar")
login(hf_token, add_to_git_credential=False)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


HF key looks good so far


### Accessing Llama

Yesterday you should have received approval to use this model:

https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

You can either use that today, or it's faster if you get approval for this model too.

https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

Select this link to see if you need to request approval too. Pick the version of Llama that you want below by commenting out one of these! Or skip Llama altogether.

In [13]:
# instruct models and 1 reasoning model

# Llama 3.1 is larger and you should already be approved
# see here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

# LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Llama 3.2 is smaller but you might need to request access again
# see here: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

LLAMA = "meta-llama/Llama-3.2-3B-Instruct"

PHI = "microsoft/Phi-4-mini-instruct"
GEMMA = "google/gemma-3-270m-it"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"
DEEPSEEK = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

In [14]:
messages = [
    {"role": "user", "content": "Tell a joke for a room of Data Scientists"}
  ]

# Accessing Llama 3.1 from Meta

In order to use the fantastic Llama 3.1, Meta does require you to sign their terms of service.

Visit their model instructions page in Hugging Face:
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B

At the top of the page are instructions on how to agree to their terms. If possible, you should use the same email as your huggingface account.

In my experience approval comes in a couple of minutes. Once you've been approved for any 3.1 model, it applies to the whole family of models.

If you have any problems accessing Llama, please see this colab, including some suggestions if you don't get approved by Meta for any reason.

https://colab.research.google.com/drive/1deJO03YZTXUwcq2vzxWbiBhrRuI29Vo8

In [17]:
# Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

El fragmento de c√≥digo anterior crea un objeto de configuraci√≥n para **cuantizaci√≥n de 4 bits** usando la librer√≠a **`bitsandbytes`**, que permite cargar y ejecutar modelos de lenguaje grandes (LLMs) con un uso de memoria significativamente reducido.

Este tipo de cuantizaci√≥n es especialmente √∫til cuando se trabaja con GPUs con poca VRAM (por ejemplo, 16 GB o menos), ya que permite ejecutar modelos como Llama-3-8B, Mistral-7B o incluso Llama-3-70B en hardware limitado.

---

## üîß Desglose del c√≥digo

```python
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)
```

### 1. `load_in_4bit=True`
- **Qu√© hace**: Indica que el modelo debe cargarse en **cuantizaci√≥n de 4 bits** (es decir, en lugar de usar 16 o 32 bits por peso, usa solo 4 bits).
- **Por qu√© se usa**: Reduce dr√°sticamente el uso de memoria (~75% menos que FP16). Por ejemplo, un modelo de 7B par√°metros en FP16 ocupa ~14 GB, pero en 4-bit ocupa ~6 GB.

---

### 2. `bnb_4bit_use_double_quant=True`
- **Qu√© hace**: Aplica **cuantizaci√≥n doble**:
  - Primero cuantiza los pesos del modelo a 4 bits.
  - Luego cuantiza **los factores de escala** (usados en la cuantizaci√≥n) a **8 bits**.
- **Por qu√© se usa**: Los factores de escala (que permiten deshacer la cuantizaci√≥n parcialmente durante la inferencia) normalmente se guardan en FP16 (~2 bytes por escala). Al cuantizarlos a 8 bits, se ahorra **~0.5‚Äì1 GB adicionales** en modelos grandes, con **p√©rdida de precisi√≥n m√≠nima**.
- **Impacto**: Mejora la eficiencia de memoria sin degradar significativamente la calidad del modelo.

---

### 3. `bnb_4bit_compute_dtype=torch.bfloat16`
- **Qu√© hace**: Define el **tipo de datos usado durante los c√°lculos** (matmul, activaciones, etc.) **despu√©s de des-cuantizar los pesos temporalmente**.
- **Opciones comunes**:
  - `torch.float32`: Precisi√≥n m√°xima, lento y usa m√°s memoria.
  - `torch.float16`: Buena compatibilidad con GPUs NVIDIA (Volta+), pero puede causar underflow en modelos grandes.
  - `torch.bfloat16` (**recomendado en A100/H100/L4**): Tiene el mismo rango que FP32 pero menos precisi√≥n; ideal para modelos de lenguaje, evita underflow y es r√°pido en GPUs modernas.
- **Por qu√© se usa**: Permite mantener **estabilidad num√©rica** durante la inferencia mientras se aprovecha la aceleraci√≥n por hardware.  
  ‚Üí **`bfloat16` es ideal si tu GPU lo soporta** (NVIDIA Ampere o superior, como A100, RTX 3090/4090, L4, etc.).

---

### 4. `bnb_4bit_quant_type="nf4"`
- **Qu√© hace**: Especifica el **esquema de cuantizaci√≥n de 4 bits** a usar.
- **Opciones**:
  - `"fp4"`: Cuantizaci√≥n de 4 bits con formato de punto flotante est√°ndar.
  - `"nf4"` (**Normalized Float 4**): Dise√±ado espec√≠ficamente para **pesos de redes neuronales**, que suelen seguir una distribuci√≥n normal (centrada en 0).
- **Por qu√© se usa `"nf4"`**:
  - Los pesos de los modelos de transformer est√°n t√≠picamente distribuidos como \(\mathcal{N}(0, \sigma^2)\).
  - `"nf4"` asigna **m√°s niveles de cuantizaci√≥n alrededor del 0**, donde la mayor√≠a de los pesos se concentran.
  - Esto reduce el **error de cuantizaci√≥n** en comparaci√≥n con `"fp4"`, mejorando la precisi√≥n del modelo cuantizado.

---

## üìå Par√°metros adicionales de `BitsAndBytesConfig` (menos comunes)

Aunque no se usan en tu ejemplo, otros par√°metros √∫tiles incluyen:

| Par√°metro | Descripci√≥n |
|--------|------------|
| `load_in_8bit` | Alternativa: cuantizaci√≥n de 8 bits (m√°s precisa, pero usa m√°s memoria). |
| `bnb_4bit_quant_storage` | Tipo de datos en el que se **almacenan los pesos cuantizados** (por defecto `uint8`). Rara vez se cambia. |
| `llm_int8_threshold` | (Solo para 8-bit) Umbral para usar matmuls en int8. Por defecto `6.0`. |
| `llm_int8_has_fp16_weight` | (Solo para 8-bit) Indica si los pesos originales eran FP16. |

---

## üß† ¬øC√≥mo se usa esta config?

Se pasa al cargador del modelo:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=quant_config,
    device_map="auto",  # Distribuye capas en CPU/GPU seg√∫n memoria
    trust_remote_code=True  # necesario para algunos modelos
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
```

---

## ‚úÖ Ventajas de esta configuraci√≥n

| Beneficio | Explicaci√≥n |
|--------|-----------|
| **Menor uso de VRAM** | Permite ejecutar modelos grandes en GPUs de consumo. |
| **Mantener calidad razonable** | `"nf4" + double quant` minimiza la p√©rdida de precisi√≥n. |
| **Compatibilidad con Hugging Face** | Integraci√≥n nativa en `transformers`. |
| **Inferencia acelerada** | Usa kernels optimizados de `bitsandbytes` en CUDA. |

---

## ‚ö†Ô∏è Limitaciones

- **No apto para entrenamiento fino completo**: Solo para inferencia o t√©cnicas como QLoRA.
- **Requiere CUDA**: `bitsandbytes` no soporta cuantizaci√≥n en CPU (aunque el modelo puede cargar en CPU con `device_map="cpu"`, ser√° lento).
- **Precisi√≥n reducida**: No es id√≥neo para tareas muy sensibles a errores num√©ricos.

---

## üìö Referencias clave

- Paper original de **QLoRA** (Dettmers et al., 2023): introduce `nf4` y doble cuantizaci√≥n.  
- Documentaci√≥n de Hugging Face: [4-bit quantization](https://huggingface.co/docs/transformers/main/en/quantization)
- `bitsandbytes` docs: [https://github.com/TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes)

If the next cell gives you a 403 permissions error, then please check:
1. Are you logged in to HuggingFace? Try running `login()` to check your key works
2. Did you set up your API key with full read and write permissions?
3. If you visit the Llama3.1 page at https://huggingface.co/meta-llama/Meta-Llama-3.1-8B, does it show that you have access to the model near the top?

And work through my Llama troubleshooting colab:

https://colab.research.google.com/drive/1deJO03YZTXUwcq2vzxWbiBhrRuI29Vo8


In [36]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", return_attention_mask = True).to("cuda")

```python
tokenizer.pad_token = tokenizer.eos_token
```

Es una **pr√°ctica com√∫n al usar ciertos tokenizadores de modelos de lenguaje**, especialmente aquellos entrenados **solo para generaci√≥n** (no para clasificaci√≥n o tareas con batches de distinta longitud), como **Llama o Mistral**

---

### üîπ ¬øQu√© es `eos_token`?

- **`eos_token`** = **End-Of-Sequence token** (token de fin de secuencia).
- Es un token especial que el modelo aprende durante el entrenamiento para indicar **d√≥nde termina una secuencia de texto**.
- Por ejemplo, en Llama o GPT-2, el `eos_token` suele ser `</s>` o `<|endoftext|>

In [37]:
inputs

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,   4723,    220,   2366,     20,    271, 128009, 128006,
            882, 128007,    271,  41551,    264,  22380,    369,    264,   3130,
            315,   2956,  57116, 128009]], device='cuda:0')

In [38]:
# The model
model = AutoModelForCausalLM.from_pretrained(LLAMA,
                                             device_map="auto",
                                             quantization_config=quant_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [39]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

Memory footprint: 2,197.6 MB


## Looking under the hood at the Transformer model

The next cell prints the HuggingFace `model` object for Llama.

This model object is a Neural Network, implemented with the Python framework PyTorch. The Neural Network uses the architecture invented by Google scientists in 2017: the Transformer architecture.

While we're not going to go deep into the theory, this is an opportunity to get some intuition for what the Transformer actually is.

If you're completely new to Neural Networks, check out my [YouTube intro playlist](https://www.youtube.com/playlist?list=PLWHe-9GP9SMMdl6SLaovUQF2abiLGbMjs) for the foundations.

Now take a look at the layers of the Neural Network that get printed in the next cell. Look out for this:

- It consists of layers
- There's something called "embedding" - this takes tokens and turns them into 4,096 dimensional vectors. We'll learn more about this in Week 5.
- There are then 16 sets of groups of layers (32 for Llama 3.1) called "Decoder layers". Each Decoder layer contains three types of layer: (a) self-attention layers (b) multi-layer perceptron (MLP) layers (c) batch norm layers.
- There is an LM Head layer at the end; this produces the output

Notice the mention that the model has been quantized to 4 bits.

It's not required to go any deeper into the theory at this point, but if you'd like to, I've asked our mutual friend to take this printout and make a tutorial to walk through each layer. This also looks at the dimensions at each point. If you're interested, work through this tutorial after running the next cell:

https://chatgpt.com/canvas/shared/680cbea6de688191a20f350a2293c76b

In [40]:
# Execute this cell and look at what gets printed; investigate the layers
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((3072,)

### And if you want to go even deeper into Transformers

In addition to looking at each of the layers in the model, you can actually look at the HuggingFace code that implements Llama using PyTorch.

Here is the HuggingFace Transformers repo:  
https://github.com/huggingface/transformers

And within this, here is the code for Llama 4:  
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama4/modeling_llama4.py

Obviously it's not neceesary at all to get into this detail - the job of an AI engineer is to select, optimize, fine-tune and apply LLMs rather than to code a transformer in PyTorch. OpenAI, Meta and other frontier labs spent millions building and training these models. But it's a fascinating rabbit hole if you're interested!

In [41]:
# OK, with that, now let's run the model!
model.config.pad_token_id = model.config.eos_token_id
outputs = model.generate(inputs, max_new_tokens=80)
outputs[0]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


tensor([128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
            25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
           220,   1627,   4723,    220,   2366,     20,    271, 128009, 128006,
           882, 128007,    271,  41551,    264,  22380,    369,    264,   3130,
           315,   2956,  57116, 128009, 128006,  78191, 128007,    271,  10445,
          1550,    279,  10550,    733,    311,  15419,   1980,  18433,    433,
           574,  20558,    311,   1505,   1202,  26670,    449,   8903,     13,
        128009], device='cuda:0')

In [42]:
# Well that doesn't make much sense!
# How about this..

tokenizer.decode(outputs[0])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Nov 2025\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell a joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nWhy did the dataset go to therapy?\n\nBecause it was struggling to find its correlation with reality.<|eot_id|>'

In [43]:
# Clean up memory
# Thank you Kuan L. for helping me get this to properly free up memory!
# If you select "Show Resources" on the top right to see GPU memory, it might not drop down right away
# But it does seem that the memory is available for use by new models in the later code.

del model, inputs, tokenizer, outputs
gc.collect()
torch.cuda.empty_cache()

## A couple of quick notes on the next block of code:

I'm using a HuggingFace utility called TextStreamer so that results stream back.
To stream results, we simply replace:  
`outputs = model.generate(inputs, max_new_tokens=80)`  
With:  
`streamer = TextStreamer(tokenizer)`  
`outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)`

Also I've added the argument `add_generation_prompt=True` to my call to create the Chat template. This ensures that Phi generates a response to the question, instead of just predicting how the user prompt continues. Try experimenting with setting this to False to see what happens. You can read about this argument here:

https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts

Thank you to student Piotr B for raising the issue!

In [46]:
# Wrapping everything in a function - and adding Streaming and generation prompts

def generate(model, messages, quant=True, max_new_tokens=80):
  device = "cuda" if torch.cuda.is_available() else "cpu"
  tokenizer = AutoTokenizer.from_pretrained(model)
  tokenizer.pad_token = tokenizer.eos_token
  input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
  attention_mask = torch.ones_like(input_ids, dtype=torch.long, device=device)
  streamer = TextStreamer(tokenizer)
  if quant:
    model = AutoModelForCausalLM.from_pretrained(model, quantization_config=quant_config).to(device)
    model.config.pad_token_id = model.config.eos_token_id
  else:
    model = AutoModelForCausalLM.from_pretrained(model).to(device)
    model.config.pad_token_id = model.config.eos_token_id
  outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=max_new_tokens, streamer=streamer)
  

In [47]:
generate(LLAMA, messages)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Nov 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here's one:

Why did the regression model go to therapy?

Because it was struggling to predict its emotions!

(Get it? Predict... like in regression analysis... ahh, data puns)

Or how about this one:

Why did the data scientist break up with his girlfriend?

Because he realized he was in a linear relationship, but the correlation was just not strong enough!

(I know, I know


## Accessing Gemma from Google

A student let me know (thank you, Alex K!) that Google also now requires you to accept their terms in HuggingFace before you use Gemma.

Please visit their model page at this link and confirm you're OK with their terms, so that you're granted access.

https://huggingface.co/google/gemma-3-270m-it

In [None]:
messages = [
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]
generate(GEMMA, messages, quant=False)

In [None]:
generate(QWEN, messages)

In [None]:
generate(DEEPSEEK, messages, quant=False, max_new_tokens=500)