# Hugging Face LLM Course ‚Äî Chapter 2 Study Guide

Use this notebook as a hands-on companion to [Chapter 2](https://huggingface.co/learn/llm-course/chapter2) of the ü§ó LLM course. Each section mirrors the chapter structure with runnable snippets, checklists, and reflection prompts.

## Learning objectives
- Understand what happens behind the `pipeline()` helper.
- Inspect raw model outputs (hidden states, logits) and turn them into human-readable predictions.
- Master tokenizer APIs: encoding, padding, truncation, attention masks, and decoding.
- Manage checkpoints, configs, and weights for models and tokenizers.
- Handle multi-sentence batches safely and reason about special tokens.
- Explore deployment considerations for optimized inference stacks.

## Table of contents
1. [Environment setup](#env-setup)
2. [Warm-up: sentiment pipeline](#sentiment-pipeline)
3. [Behind the pipeline: tokenizer ‚Üí model ‚Üí head](#behind-pipeline)
4. [Model zoo: creating, saving, and loading transformers](#models)
5. [Tokenizer deep dive: encoding, padding, truncation](#tokenizers)
6. [Handling batches, masks, and multiple sequences](#handling-batches)
7. [Special tokens & round-trips](#special-tokens)
8. [Wrapping up: from tokenizer to model](#wrap-up)
9. [Deployment bonus: TGI vs vLLM vs llama.cpp](#deployment)

> üìù **Tip:** After running a block, jot down how the output ties back to the narrative in Chapter 2. Treat this like a lab notebook.


In [1]:
%mkdir /content/drive/MyDrive/Colab\ Notebooks/LLM-Course
%cd /content/drive/MyDrive/Colab Notebooks/LLM-Course

mkdir: cannot create directory ‚Äò/content/drive/MyDrive/Colab Notebooks/LLM-Course‚Äô: File exists
/content/drive/MyDrive/Colab Notebooks/LLM-Course


## 1. Environment setup <a name="env-setup"></a>

These cells mirror the Colab setup used in the official course notebooks. Feel free to skip them locally; they mainly showcase how to organize assets on Google¬†Drive when following along in Colab.


In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## 2. Warm-up: `pipeline()` sentiment classifier <a name="sentiment-pipeline"></a>

This mirrors the opening example from the chapter. Observe how a single helper downloads a tokenizer, model, and head, then performs preprocessing ‚Üí forward pass ‚Üí postprocessing for you.

> ‚úÖ **Practice:** Swap in your own sentences or another task (e.g., `"text-classification"`, `"question-answering"`) and note how little code changes.


In [3]:
from transformers import AutoTokenizer, AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(f"Tokenizer out put = {inputs}")

model_checkpoient = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(model_checkpoient)

outputs = model(**inputs)

print(f"\nHidden State dim = {outputs.last_hidden_state.shape}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Tokenizer out put = {'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]


Hidden State dim = torch.Size([2, 16, 768])


## 3. Behind the pipeline: tokenizer ‚Üí base model ‚Üí head <a name="behind-pipeline"></a>

Chapter 2 peels back the convenience wrapper. The next blocks walk through:
1. Instantiating the tokenizer and preparing tensors (padding, truncation, `return_tensors`).
2. Inspecting raw hidden states from `AutoModel`.
3. Adding a task-specific head via `AutoModelForSequenceClassification` and converting logits to probabilities.

> üß™ **Goal:** Match the outputs from the `pipeline()` example by recreating each stage manually.


In [4]:
from transformers import AutoModelForSequenceClassification

head_checkpoient = "distilbert-base-uncased-finetuned-sst-2-english"
head = AutoModelForSequenceClassification.from_pretrained(head_checkpoient)
outputs = head(**inputs)

print(f"\nHead OutPut shape{outputs.logits.shape}")
print(f"\nHead OutPut{outputs.logits}")

import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(f"\nPredictions{predictions}")


Head OutPut shapetorch.Size([2, 2])

Head OutPuttensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

Predictionstensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


### 3.1 Inspecting raw hidden states
- `outputs.last_hidden_state` ‚Üí shape `[batch, sequence_length, hidden_size]`
- Use this to plug custom heads or probe embeddings.

### 3.2 Adding a sequence-classification head
The course highlights how specialized heads (classification, QA, token tagging‚Ä¶) sit atop the shared transformer body. The next cell swaps `AutoModel` for `AutoModelForSequenceClassification` so we can obtain logits directly.


In [5]:
id2label = model.config.id2label
predicted_ids = torch.argmax(predictions, dim=-1)
readable_output = [
    {"label": id2label[idx.item()], "score": predictions[i, idx].item()}
    for i, idx in enumerate(predicted_ids)
]
readable_output

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### 3.3 Postprocessing logits
Softmax converts logits to probabilities, while `model.config.id2label` maps indices to human-readable labels. Compare the dict the `pipeline` returned with the `readable_output` below.


## 4. Model zoo: creating, saving, and loading transformers <a name="models"></a>

These steps echo Section 3 of the chapter. Focus on:
- **Auto classes** (`AutoModel`, `AutoModelForSequenceClassification`, etc.) for quick instantiation.
- **Explicit architecture classes** (`BertModel`) when you need tight control.
- **Config vs weights:** `config.json` captures hyperparameters, while `.safetensors` (or `.bin`) stores learned parameters.
- **Portability:** `save_pretrained()` + `from_pretrained()` make it trivial to checkpoint locally, reload, or share via the Hub.


In [6]:
from transformers import AutoModel,BertModel

# AutoModel: fetch the appropriate model architecture for a given checkpoint.
# It‚Äôs an ‚Äúauto‚Äù class meaning it will guess the appropriate model architecture and instantiate the correct model class.
auto_model = AutoModel.from_pretrained("bert-base-cased")

# BertModel: if you know the type of model you want to use, you can use the class that defines its architecture directly
bert_model = BertModel.from_pretrained("bert-base-cased")

# saving model
bert_model.save_pretrained("./models")


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

> üìå **Checkpoint anatomy reminder**
> - `config.json`: architecture + metadata (hidden size, num layers, label mappings, etc.).
> - `model.safetensors`: weights/state dict. Prefer `.safetensors` for speed + security.
> - `tokenizer.json` / `vocab.txt`: tokenizer assets (see later section).


In [7]:
%ls && cd models/ && ls

[0m[01;34mmodels[0m/  [01;34mTokenizer_weigts[0m/
config.json  model.safetensors


In [8]:
%cd models/
%cat config.json

/content/drive/MyDrive/Colab Notebooks/LLM-Course/models
{
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "dtype": "float32",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}


In [9]:
# Load
%cd /content/drive/MyDrive/Colab Notebooks/LLM-Course

loaded_model = AutoModel.from_pretrained("models")

/content/drive/MyDrive/Colab Notebooks/LLM-Course


## 5. Tokenizer deep dive: encoding text <a name="tokenizers"></a>

Follow Section 4 of the chapter: tokenizers translate raw strings into tensors the model can digest. The upcoming cells cover encoding, decoding, padding, truncation, and batching behaviors.


In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encode_input = tokenizer("Hello, How are you!")

encode_input

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'input_ids': [101, 8667, 117, 1731, 1132, 1128, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

### 5.1 Single-sentence encoding
- `return_tensors="pt"` vs `"np"` toggles backend outputs.
- Inspect `input_ids`, `token_type_ids`, `attention_mask` as described in the course.


In [11]:
# decode the input IDs
tokenizer.decode(encode_input.input_ids)

'[CLS] Hello, How are you! [SEP]'

### 5.2 Decoding round-trip
Use `tokenizer.decode()` (optionally `skip_special_tokens=True`) to verify reversible transformations.


In [12]:
# encode & decode multiple sentences at once
encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)

tokenizer.decode(encoded_input.input_ids)

{'input_ids': [101, 1731, 1132, 1128, 136, 102, 146, 112, 182, 2503, 117, 6243, 1128, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


"[CLS] How are you? [SEP] I ' m fine, thank you! [SEP]"

### 5.3 Encoding multiple sentences
Passing multiple strings returns lists per field; the course highlights how this maps to batching later.


In [13]:
# Padding Inputs
encode_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
encode_input

{'input_ids': tensor([[ 101, 1731, 1132, 1128,  136,  102,    0,    0,    0,    0],
        [ 101,  146,  112,  182, 2503,  117, 6243, 1128,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

### 5.4 Padding inputs
The `padding=True` flag produces rectangular tensors and introduces `attention_mask` indicators, exactly as described in Figure 5 of the chapter.


In [14]:
# decode the padding input IDs
decoded_outputs = [tokenizer.decode(ids, skip_special_tokens=True) for ids in encode_input.input_ids]
print(decoded_outputs)

['How are you?', "I ' m fine, thank you!"]


In [15]:
# Truncating inputs
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)
print(encoded_input["input_ids"])

[101, 1188, 1110, 170, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1263, 5650, 119, 102]


### 5.5 Truncation controls
Use `truncation=True` (plus optional `max_length`) to respect model constraints (e.g., 512 tokens for BERT-family checkpoints).


In [16]:
# combining the padding and truncation arguments make tensors have the exact size
encode_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
        padding=True,
        truncation=True,
        max_length=5,
        return_tensors="pt",
  )
print(encoded_input)

{'input_ids': [101, 1188, 1110, 170, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1263, 5650, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


### 5.6 Exact-shape tensors
Combining `padding`, `truncation`, and `max_length` guarantees consistent tensor shapes‚Äîhandy when exporting to ONNX/TorchScript or batching across devices.


In [17]:
encoded_sequences = [
    [
        101,
        1045,
        1005,
        2310,
        2042,
        3403,
        2005,
        1037,
        17662,
        12172,
        2607,
        2026,
        2878,
        2166,
        1012,
        102,
    ],
    [101, 1045, 5223, 2023, 2061, 2172, 999, 102],
]
import torch

# model_inputs = torch.tensor(encoded_sequences)

# It will get an error beacuse tensor must be same size (padding / truncation)

### 5.7 Tokenizer algorithms in practice

Sections 4‚Äì5 of the chapter survey word, character, and subword strategies. Use the following snippets to compare behaviors and vocabulary sizes.


In [18]:
# word-based
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


- **Word-based tokenizers** explode vocabulary size and struggle with out-of-vocabulary (OOV) tokens.
- **Character-based tokenizers** shrink the vocab but lengthen sequences.
- **Subword tokenizers** (WordPiece/BPE/Unigram) strike a balance by keeping frequent words intact and splitting rare ones into morphemes.


In [19]:
# character-based tokenizer

# subword tokenization

### 6.1 Loading & saving tokenizers
Mirror how the chapter reuses `from_pretrained()`/`save_pretrained()` for tokenizer assets. Compare `AutoTokenizer` vs architecture-specific tokenizers when you need deterministic behavior.


In [20]:
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

auto_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

auto_tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [21]:
bert_tokenizer("Using a Transformer network is simple")==auto_tokenizer("Using a Transformer network is simple")

True

In [22]:
%cd /content/drive/MyDrive/Colab Notebooks/LLM-Course
auto_tokenizer.save_pretrained("Tokenizer_weigts")

/content/drive/MyDrive/Colab Notebooks/LLM-Course


('Tokenizer_weigts/tokenizer_config.json',
 'Tokenizer_weigts/special_tokens_map.json',
 'Tokenizer_weigts/vocab.txt',
 'Tokenizer_weigts/added_tokens.json',
 'Tokenizer_weigts/tokenizer.json')

In [23]:
# Tokenization
sequence = "Using a Transformer network is simple"
tokens = auto_tokenizer.tokenize(sequence)

print(f"Tokens = {tokens}")

# From tokens to input ID
ids = auto_tokenizer.convert_tokens_to_ids(tokens=tokens)
print(f"IDS = {ids}")

Tokens = ['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
IDS = [7993, 170, 13809, 23763, 2443, 1110, 3014]


> üîÅ **Two-stage encoding recap**
> 1. `tokenize()` ‚Üí list of subword strings.
> 2. `convert_tokens_to_ids()` ‚Üí vocabulary indices ready for tensors.


### Decoding

In [24]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

Using a Transformer network is simple


## 6. Handling multiple sequences, padding, and attention masks <a name="handling-batches"></a>

Sections 5‚Äì6 of the chapter emphasize that tensors must be rectangular and that padding tokens should be masked out. The next cells intentionally contrast per-sequence processing vs. properly padded batches.


In [25]:
import torch
from transformers import AutoTokenizer , AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens) # this will get singel sequence of ids [] and the model expect batch [[],[]]

input_ids = torch.tensor([ids])

print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


### 6.1 Manual batching pitfalls
Creating tensors from raw ID lists fails when rows differ in length‚Äîprecisely why tokenizers add padding for you.


### Padding && Attention masks

In [26]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(tokenizer.pad_token_id)
print("---------------Sequence by Sequance-----------")
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)

# Note: the secoud row of logits for padding comppletly diffrent for second seqquance
print("---------------Padding Batch-----------")

print(model(torch.tensor(batched_ids)).logits)

# Attention masks will solve difrenlogits values problems
attention_mask = [
    [1,1,1],
    [1,1,0]
]
print("---------------Padding Batch & Attention masks-----------")
print(f"{model(torch.tensor(batched_ids),attention_mask=torch.tensor(attention_mask)).logits}")

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


0
---------------Sequence by Sequance-----------
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
---------------Padding Batch-----------
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)
---------------Padding Batch & Attention masks-----------
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


### 6.2 Attention masks save the day
Padding tokens should contribute `0` to the mask so the self-attention mechanism ignores them. Compare logits with and without masks to see the impact highlighted in the course.


In [27]:
sequence1 = "I've been waiting for a HuggingFace course my whole life."
sequence2 = "I hate this so much!"

print("--- Step 1: Processing sequences individually ---")
ids1 = tokenizer.encode(sequence1)
input_ids1 = torch.tensor([ids1])
output1 = model(input_ids1)
print(f"Logits (Sentence 1): {output1.logits}")
ids2 = tokenizer.encode(sequence2)
input_ids2 = torch.tensor([ids2])
output2 = model(input_ids2)
print(f"Logits (Sentence 2): {output2.logits}")

print("\n--- Step 2: Processing sequences as a batch ---")

max_length = max(len(ids1), len(ids2))
pad_token_id = tokenizer.pad_token_id

padded_ids2 = ids2 + [pad_token_id] * (max_length - len(ids2))
batched_ids = torch.tensor([ids1, padded_ids2])
print(f"Batched Input IDs shape: {batched_ids.shape}")

mask1 = [1] * len(ids1)
mask2 = [1] * len(ids2) + [0] * (max_length - len(ids2))

attention_mask = torch.tensor([mask1, mask2])
print(f"Attention Mask shape: {attention_mask.shape}")
batched_outputs = model(batched_ids, attention_mask=attention_mask)

print(f"\nLogits (Batched):\n{batched_outputs.logits}")

--- Step 1: Processing sequences individually ---
Logits (Sentence 1): tensor([[-1.5607,  1.6123]], grad_fn=<AddmmBackward0>)
Logits (Sentence 2): tensor([[ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

--- Step 2: Processing sequences as a batch ---
Batched Input IDs shape: torch.Size([2, 16])
Attention Mask shape: torch.Size([2, 16])

Logits (Batched):
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


### 6.3 Step-by-step batching checklist
1. Tokenize each sequence.
2. Pad shorter ones with `tokenizer.pad_token_id`.
3. Build `attention_mask` with 1s for real tokens and 0s for padding.
4. Feed both `input_ids` and `attention_mask` to the model.


## 7. Special tokens, padding strategies, and putting it all together <a name="special-tokens"></a>

The tokenizer exposes multiple padding/truncation policies plus auto-added special tokens (`[CLS]`, `[SEP]`, `[PAD]`). Experiment with each mode and observe how `input_ids`, `attention_mask`, and `token_type_ids` change.


In [28]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequence)

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
print("\n-------Padding Longest-------\n")
print(model_inputs)

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
print("\n-------Padding Max-Length-------\n")
print(model_inputs)

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
print("\n-------Padding  Custom Length-------\n")
print(model_inputs)

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
print("\n-------Truncate Max Length-------\n")
print(model_inputs)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
print("\n-------Truncate  Custom Length-------\n")
print(model_inputs)

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
print("\n-------Pytorch Tensor-------\n")
print(model_inputs)

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print("\n-------NumPy Arrays-------\n")
print(model_inputs)


-------Padding Longest-------

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

-------Padding Max-Length-------

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

### Special tokens


In [29]:
model_inputs = tokenizer(sequence)
print("\n-------Auto Tokenizer Encode------\n")
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print("\n-------Manual Tokenizer Encode------\n")
print(ids)

print("\n-------Auto Tokenizer Decode------\n")
print(tokenizer.decode(model_inputs["input_ids"]))
print("\n-------Manual Tokenizer Decode------\n")
print(tokenizer.decode(ids))


-------Auto Tokenizer Encode------

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]

-------Manual Tokenizer Encode------

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

-------Auto Tokenizer Decode------

[CLS] i've been waiting for a huggingface course my whole life. [SEP]

-------Manual Tokenizer Decode------

i've been waiting for a huggingface course my whole life.


## 8. Wrapping up: From tokenizer to model <a name="wrap-up"></a>

This final block mirrors the chapter finale: combine everything into a concise helper that tokenizes, batches, and feeds directly into a task head. Use it as a sanity check after tinkering with individual pieces.


In [30]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

## 9. Deployment bonus: TGI vs vLLM vs llama.cpp <a name="deployment"></a>

Use this table to connect the course material with production-ready inference stacks. Record your own benchmarking notes in the table as you experiment.


In [31]:
!pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.16.tar.gz (50.7 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m50.7/50.7 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m45.5/45.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml)

In [32]:
%cd /content/drive/MyDrive/Colab\ Notebooks/LLM-Course
%mkdir models
%cd models

/content/drive/MyDrive/Colab Notebooks/LLM-Course
mkdir: cannot create directory ‚Äòmodels‚Äô: File exists
/content/drive/MyDrive/Colab Notebooks/LLM-Course/models


In [33]:
from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="lmstudio-community/SmolLM2-1.7B-Instruct-GGUF",
	filename="SmolLM2-1.7B-Instruct-Q3_K_L.gguf",
   n_ctx=4096,  # Context window size
    n_threads=8,  # CPU threads
    n_gpu_layers=0,  # GPU layers (0 = CPU only)
)


# Format prompt according to the model's expected format
prompt = """<|im_start|>system
You are a creative storyteller.
<|im_end|>
<|im_start|>user
Write a creative story
<|im_end|>
<|im_start|>assistant
"""

# Generate response with precise parameter control
output = llm(
    prompt,
    max_tokens=200,
    temperature=0.8,
    top_p=0.95,
    frequency_penalty=0.5,
    presence_penalty=0.5,
    stop=["<|im_end|>"],
)

print(output["choices"][0]["text"])

./SmolLM2-1.7B-Instruct-Q3_K_L.gguf:   0%|          | 0.00/933M [00:00<?, ?B/s]

llama_model_loader: loaded meta data with 34 key-value pairs and 218 tensors from /root/.cache/huggingface/hub/models--lmstudio-community--SmolLM2-1.7B-Instruct-GGUF/snapshots/e54b91ee756cb2d5fd8615757239acc89f88c7e0/./SmolLM2-1.7B-Instruct-Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Smollm2 1.7B 8k Mix7 Ep2 v2
llama_model_loader: - kv   3:                            general.version str              = v2
llama_model_loader: - kv   4:                       general.organization str              = Loubnabnl
llama_model_loader: - kv   5:                           general.finetune str              = 8k-mix7-ep2
llama

In the heart of a mystical forest, there lived a young girl named Lila. Lila was known for her extraordinary abilities ‚Äì she could communicate with animals, and her touch could heal even the most grievous wounds. Despite her gifts, Lila was often lonely, as she preferred to spend time in solitude, surrounded by nature's beauty rather than human company.

One day, while wandering through the forest, Lila stumbled upon an ancient tree unlike any she had ever seen before. Its trunk was twisted and gnarled, its branches reaching towards the sky like a grand cathedral. As soon as Lila touched the tree, she felt an energy coursing through her veins, and suddenly, she was flooded with visions of the world's history.

Lila witnessed the rise and fall of civilizations, the birth of new worlds, and the eternal struggle between light and darkness. The tree spoke to her in whispers, its wisdom echoing deep within Lila's mind. As the days


In [34]:
output

{'id': 'cmpl-bcf5dd5c-fc7b-4f05-be7a-84468f4a5c20',
 'object': 'text_completion',
 'created': 1763215117,
 'model': '/root/.cache/huggingface/hub/models--lmstudio-community--SmolLM2-1.7B-Instruct-GGUF/snapshots/e54b91ee756cb2d5fd8615757239acc89f88c7e0/./SmolLM2-1.7B-Instruct-Q3_K_L.gguf',
 'choices': [{'text': "In the heart of a mystical forest, there lived a young girl named Lila. Lila was known for her extraordinary abilities ‚Äì she could communicate with animals, and her touch could heal even the most grievous wounds. Despite her gifts, Lila was often lonely, as she preferred to spend time in solitude, surrounded by nature's beauty rather than human company.\n\nOne day, while wandering through the forest, Lila stumbled upon an ancient tree unlike any she had ever seen before. Its trunk was twisted and gnarled, its branches reaching towards the sky like a grand cathedral. As soon as Lila touched the tree, she felt an energy coursing through her veins, and suddenly, she was flooded w

### Reflection prompts
- Which stage of the pipeline (tokenizer, base model, head) felt most opaque before this chapter? Note a key insight after running each block.
- Can you explain why attention masks are necessary using your own words?
- Swap the checkpoint (e.g., `distilroberta-base`) and document what changes in the config, tokenizer vocab, and special tokens.
- Summarize the trade-offs between TGI, vLLM, and llama.cpp for your deployment context.
