# Tokenization → Token IDs → Embeddings (DistilBERT)

This tutorial demonstrates **exactly** how an open-source LLM processes text:

**Text → Tokens → Token IDs → Embeddings → Transformer**

We use `distilbert-base-uncased` for clarity.

## 1. Install dependencies (run once)

## 2. Load tokenizer and model

In [2]:
import torch
from transformers import AutoTokenizer, AutoModel

model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

  from .autonotebook import tqdm as notebook_tqdm


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

## 3. Example sentence

In [3]:
sentence = 'Turbulence modeling is hard'
sentence

'Turbulence modeling is hard'

## 4. Tokenization (text → tokens)

In [4]:
tokens = tokenizer.tokenize(sentence)
tokens

['turbulence', 'modeling', 'is', 'hard']

## 5. Token IDs (tokens → numbers)

In [5]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[29083, 11643, 2003, 2524]

## 6. Full tokenizer output (what the model really sees)

In [6]:
encoded = tokenizer(sentence, return_tensors='pt')
encoded

{'input_ids': tensor([[  101, 29083, 11643,  2003,  2524,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

## 7. Embedding + Transformer output

In [7]:
with torch.no_grad():
    outputs = model(**encoded)

last_hidden_state = outputs.last_hidden_state
last_hidden_state.shape

torch.Size([1, 6, 768])

## 8. Inspect embedding for a token

In [9]:
# Token index 1 corresponds to 'turbulence'
turbulence_embedding = last_hidden_state[0, 1]
turbulence_embedding.shape

torch.Size([768])

## Key Takeaways

- The model **never sees text or words**
- It only sees **token IDs (integers)**
- Token IDs index rows of an **embedding matrix**
- After embedding lookup, **only vectors exist**

**Tokenization is a compiler front-end for neural networks.**