## Models

- Use the low-level Transformers API to access model classes (they wrap the underlying PyTorch code).  
- We can run this notebook on a free or low-cost T4 GPU runtime.  


In [1]:
!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
!pip install -q requests bitsandbytes==0.46.0 transformers==4.48.3 accelerate==1.3.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.3/908.3 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.3/7.3 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m52.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import gc    # garbage collector
import torch
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig

## 1. Hugging Face API Token

1. Go to https://huggingface.co and **sign up** or log in.  
2. Open **Settings → Access Tokens** and click **Create new token**.  
3. Under **Permissions**, select **Read & Write**, then **Generate** and copy the token.  
4. Press the "key" icon in your side-panel on the left, add a secret:  
   ```bash
   HF_TOKEN=<your_token>


In [3]:
hf_token = userdata.get("HF_TOKEN")
login(hf_token)

In [4]:
# Instruct models
LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
PHI3 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA2 = "google/gemma-2-2b-it"
QWEN2 = "Qwen/Qwen2-7B-Instruct"
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1"

In [5]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of AI Engineer"}
]

## 2- Quantization

- **What?**  
  Reduces the bit-width of model weights (normally 32-bit floats) to smaller sizes (8-bit, 4-bit).  

  Accuracy sure is hurt, but not by as much as we might expect.
- **Why?**  
  • Saves GPU/CPU memory  
  • Speeds up inference  
  • Enables running large models on smaller hardware  
- **How?**  
  Load the model with a quantization config instead of full precision.

  Now, we access Llama 3.1 from Meta.



  ### Special Tokens: EOS & PAD

- **`eos_token`**  
  Marks the **end of a sequence** (where generation stops).

- **`pad_token`**  
  Fills shorter sequences up to a fixed length for batching.  
  By setting it to `eos_token`, you treat padding as “end-of-sequence” and avoid warnings.


In [6]:
# 1- Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    # Double-quantize weights for extra memory savings with minimal accuracy loss
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    # Use 4-bit normalized float (NF4) quantization for more accurate compression of normally distributed weights.
    # N stands for Normalization
    bnb_4bit_quant_type="nf4"
)

# 2- Tokenizer

# To create a tokenizer for Lama
tokenizer = AutoTokenizer.from_pretrained(LLAMA)
# Use EOS as pad token to fill prompts and avoid padding warnings
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

# 3- The model

# Loads an autoregressive (causal) LLM that predicts the next token based on previous tokens
model = AutoModelForCausalLM.from_pretrained(
    LLAMA,                            # Pre-trained model ID or path
    device_map="auto",                # Automatically place model layers on available devices (GPU/CPU)
    quantization_config=quant_config  # Apply specified quantization settings for smaller memory footprint
)

# Check to how memory it uses up
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

Memory footprint: 5,591.5 MB


## 3- Model Architecture Overview

- **Embedding Layer**  
  Converts input token IDs into dense vectors (`embed_tokens`).

- **Decoder Blocks** (repeated N times)  
  Each block contains:  
  1. **Self-Attention** (`LlamaSdpAttention`):  
     Tokens attend to all previous tokens (causal mask).  
  2. **Feed-Forward Network** (`LlamaMLP`):  
     Two linear layers with a non-linear activation in between.  
  3. **Residual + LayerNorm** (`LlamaRMSNorm`):  
     Stabilizes training and preserves signal via skip connections.

- **Final LayerNorm**  
  One more normalization on the last hidden state.

- **Language Modeling Head** (`lm_head`)  
  Projects hidden states back to vocabulary size to produce logits for next‐token prediction.

> This stack of embedding → N×(attention + MLP + norm) → norm → head defines a causal (autoregressive) LLM, generating each token from all preceding tokens.  


> **Tip:** Always trace the tensor shapes—ensure your vocab size, embedding dim, hidden dims, and output dim all line up.  

In [7]:
# The model is a description of the actual deep neural network that is represented by this model object
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((409

In [8]:
# The output: we asked for a joke. A joke for a room of AI engineer
outputs = model.generate(inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of AI Engineer<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here's one:

Why did the AI model go to therapy?

Because it was struggling to process its emotions, but it kept getting stuck in a loop of self-reflection and couldn't "reboot" its feelings!<|eot_id|>


In [9]:
# Clean up
del model, inputs, tokenizer, outputs
gc.collect()
torch.cuda.empty_cache()

## 4- Streaming Results

- **What?**  
  Text streaming sends generated tokens back **as soon as** they’re produced, instead of waiting for the full sequence.

- **Why?**  
  - **Lower latency:** See words appear in real time.  
  - **Interactive feel:** Better for chat UIs or demos.  
  - **Early stopping:** You can stop generation mid-stream if you’ve seen enough.

- **Important Notes**  
  - Only works with models and tokenizers that support streaming.  
  - You must pass a `streamer` argument to `model.generate()`.  
  - Streaming adds minimal overhead—your throughput stays about the same.  


In [10]:
# A function to wrap everything - with Streaming and Generation prompts

def generate(llm_model, messages):
  # Step 1 - to create a tokenizer based on the model we are working with it
  tokenizer = AutoTokenizer.from_pretrained(llm_model)
  tokenizer.pad_token = tokenizer.eos_token   # set the padding token to be the same as the end of sentence token
  # Step 2 - Apply the chat template to `messages` → token IDs, move them to GPU, and store in `inputs`
  inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
  # Step 3 - Initialize a TextStreamer to decode and stream generated tokens in real time
  streamer = TextStreamer(tokenizer)
  # Step 4 - Load the specified autoregressive LLM with automatic device placement and the given quantization settings
  model = AutoModelForCausalLM.from_pretrained(llm_model, device_map="auto", quantization_config=quant_config)
  # Step 5- Generate up to 80 new tokens, streaming each token back in real time
  outputs = model.generate(inputs, max_new_tokens=80, streamer=streamer)
  del inputs, model, tokenizer, outputs
  gc.collect()
  torch.cuda.empty_cache()

In [11]:
generate(PHI3, messages)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

<|system|> You are a helpful assistant<|end|><|user|> Tell a light-hearted joke for a room of AI Engineer<|end|><|assistant|> Why don't AI Engineers ever play hide and seek?

Because good luck hiding when they're always so good at finding you!<|end|>


In [13]:
# NOTE: to access Gemma from google, we need to accept their terms in huggingface: https://huggingface.co/google/gemma-2-2b-it

messages = [
    {"role": "user", "content": "Tell a light-hearted joke for a room of AI Engineer"}
]

generate(GEMMA2, messages)

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

<bos><start_of_turn>user
Tell a light-hearted joke for a room of AI Engineer<end_of_turn>
<start_of_turn>model


The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


Why was the AI confused at the party? 

Because it couldn't find its *training data*! 😂 

---

Let me know if you'd like to hear another joke! 😊 
<end_of_turn>
