# **Token Embedding in Transformers**


Token embeddings map discrete text tokens (subwords, characters, bytes) into continuous vectors so attention and neural layers can operate in $\mathbb{R}^d$. They form the model’s primary lexical interface: input embeddings are consumed by the Transformer body, and (often) output logits are produced by projecting Transformer outputs back to token logits.

## **Types & tokenization (concise)**

- `One-hot + embedding matrix (canonical)`
  
- `Subword tokenization`
  
- `Byte-level / character-level`
  
- `Hybrid / morpheme-aware`
  
- `Contextual embeddings (layered)`

## **Mathematical representation (essential formulas)**

- Embedding lookup (index $t$):
$$e_t = E_{i_t} \quad\text{or}\quad e_t = E[i_t]$$
where $i_t$ is token id and $E \in \mathbb{R}^{V\times D}$.


- Input to Transformer (with positional term):
$$x_t = e_t + PE_t$$
or, more generally, $X = E[\text{tokens}] + PE$.


- Output logits (untied):
$$z = W^\top h + b,\quad W \in \mathbb{R}^{D\times V}$$


- Weight tying / shared input-output (common):  
$$W = E^\top$$ so $$z = E h$$ (reduces params, often improves calibration).


- Softmax probability:
$$
p = \mathrm{softmax}(z)
$$.

### **One-hot + embedding matrix (canonical)**

- tokens indexed by integers; lookup into a matrix $E \in \mathbb{R}^{V\times D}$ produces vectors. Often used with subword vocabularies.

- The most direct way to show token embeddings:

In [3]:
import torch
import torch.nn as nn

vocab_size = 10000
embedding_dim = 128

embedding = nn.Embedding(vocab_size, embedding_dim)

# Token IDs (batch of 3 tokens)
token_ids = torch.tensor([1, 42, 999])
embeds = embedding(token_ids)
print(embeds.shape)  # (3, embedding_dim)

torch.Size([3, 128])


### **Subword tokenization** 

- `BPE`, `WordPiece`, `Unigram`: compromise between vocabulary size and ability to represent `OOVs`; most modern Transformers use subwords.
- Most Transformers today use subword tokenization (e.g., BPE in GPT, WordPiece in BERT).

In [2]:
from transformers import AutoTokenizer, AutoModel
import torch

# Example: BERT tokenizer (WordPiece)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "Transformers are powerful."
tokens = tokenizer(text, return_tensors="pt")

# Token IDs
print(tokens["input_ids"])

# Embeddings lookup (subwords -> vectors)
with torch.no_grad():
    embeddings = model.get_input_embeddings()(tokens["input_ids"])
print(embeddings.shape)  # (batch, seq_len, hidden_dim)

tensor([[  101, 19081,  2024,  3928,  1012,   102]])
torch.Size([1, 6, 768])


In [4]:
print(embeddings)

tensor([[[ 0.0136, -0.0265, -0.0235,  ...,  0.0087,  0.0071,  0.0151],
         [ 0.0189, -0.0289, -0.0768,  ...,  0.0116, -0.0212,  0.0171],
         [-0.0134, -0.0135,  0.0250,  ...,  0.0013, -0.0183,  0.0227],
         [-0.0369, -0.0211, -0.0339,  ..., -0.0305, -0.0492, -0.0583],
         [-0.0207, -0.0020, -0.0118,  ...,  0.0128,  0.0200,  0.0259],
         [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]]])


### **Byte-level / character-level** 

- operates on bytes or characters; smaller vocab, robust to unknown words but longer sequences.

* Byte-level Tokenization

Used in `GPT-2/GPT-3/GPT-4` for robustness (no OOV).

(Behind the scenes: characters, emojis, and punctuation are split into UTF-8 byte sequences.)

In [9]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

text = "Transformers 🚀"
tokens = tokenizer(text, return_tensors="pt")

print(tokens["input_ids"])  # byte-level tokens

tensor([[41762,   364, 12520,   248,   222]])


* Character-level Tokenization

Character-level embeddings treat each character as a token.

In [10]:
# Character-level embeddings treat each character as a token.
import torch
import torch.nn as nn

# Toy character vocabulary
chars = list("abcdefghijklmnopqrstuvwxyz ")
vocab = {c: i for i, c in enumerate(chars)}
embedding_dim = 16

# Character embedding matrix
char_embedding = nn.Embedding(len(vocab), embedding_dim)

# Encode a string
text = "data"
ids = torch.tensor([vocab[c] for c in text])  # [3, 0, 19, 0]
embeds = char_embedding(ids)
print(embeds.shape)  # (len(text), embedding_dim)

torch.Size([4, 16])


### **Hybrid / morpheme-aware** 

- linguistically informed segmentation for some languages.

- Some languages (e.g., Korean, Turkish, Finnish) benefit from morpheme segmentation before embedding. Tools like SentencePiece with Unigram LM or external analyzers are used.

In [None]:
import sentencepiece as spm

# Example: train unigram tokenizer
spm.SentencePieceTrainer.train(input='data.txt', model_prefix='morph', vocab_size=8000, model_type='unigram')

sp = spm.SentencePieceProcessor(model_file="morph.model")
text = "transformers are powerful"
print(sp.encode(text, out_type=int))  # morpheme-level token ids

### **Contextual embeddings (layered)** 

- initial token embeddings are static vectors; the Transformer produces contextualized vectors after self-attention.

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load a pretrained Transformer (BERT in this example)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "The bank will not lend money near the river bank."
tokens = tokenizer(text, return_tensors="pt")

# Get embeddings
with torch.no_grad():
    outputs = model(**tokens, output_hidden_states=True)

# Raw token embeddings (input layer)
input_embeddings = outputs.hidden_states[0]   # shape: (batch, seq_len, hidden_dim)

# Contextual embeddings from final Transformer layer
contextual_embeddings = outputs.last_hidden_state  # shape: (batch, seq_len, hidden_dim)

# Contextual embeddings from all layers
all_layer_embeddings = outputs.hidden_states  # tuple of length num_layers+1

print("Input embeddings:", input_embeddings.shape)
print("Contextual embeddings (last layer):", contextual_embeddings.shape)
print("Number of layers (including embedding layer):", len(all_layer_embeddings))

  from .autonotebook import tqdm as notebook_tqdm


Input embeddings: torch.Size([1, 13, 768])
Contextual embeddings (last layer): torch.Size([1, 13, 768])
Number of layers (including embedding layer): 13


## **Key properties and desiderata**

- `Dimensionality ($D$)` — determines representational capacity and parameter cost; larger $D$ often helps scale but increases compute.

- `Vocabulary size ($V$)` — trade-off: large $V$ reduces tokenization splits but increases embedding parameters; subword tokenizers balance this.

- `Sparsity & frequency bias` — embeddings reflect training-token frequency; rare tokens can have poorly learned vectors.

- `Semantic structure` — well-trained embeddings encode lexical similarity, morphology, subword composition.

- `Index stability` — token ids must be consistent across training/serving; vocabulary changes break embedding alignment.

## **Training strategies & initialization**

- `Random init then train` — most common; embeddings learned end-to-end.

- `Pretrained embeddings` — initialize from external vectors (word2vec, fastText) then fine-tune; less common for large LMs but useful in low-resource setups.

- `Freezing / partial freezing` — freeze embedding rows to regularize or save compute; sometimes freeze rare-token embeddings.

- `Tying / factorization` — tie input and output matrices ($W = E^\top$) or factorize $E = A B$ (low-rank) to reduce params.

- `Adaptive softmax / adaptive input` — partition vocabulary into frequency bands and allocate different embedding sizes per band to save parameters.

## **Regularization & optimization considerations**

- `Embedding dropout` — drop entire token embeddings (or mask positions) to prevent overfitting.

- `Norm constraints` — clip embedding norms or apply layer norm after embeddings to stabilize training.

- `Label smoothing & temperature` — influence gradient signal back to embedding rows through output softmax.

- `Learning rate scheduling` — embeddings often benefit from same schedules as model, but sometimes different (lower) LR for pretrained init.

## **Handling rare tokens & OOV**

- `Subwords` — reduces OOVs by decomposing unknown words into known subwords.

- `Byte-level` — truly no-OOV (works with arbitrary input) at cost of longer sequences.

- `Fallback / UNK token` — maps unknowns to a single vector (lossy).

- `Compositional approximations` — build token vectors from character/subword composition using summation, CNNs, or small encoders.

## **Compression & efficiency techniques (practical)**

- `Parameter sharing / tying` — share input & output embeddings.

- `Low-rank factorization` — represent $E \approx A B$ with $A\in\mathbb{R}^{V\times r}, B\in\mathbb{R}^{r\times D}`$.

- `Quantization` — 8-bit/4-bit store and compute with minimal accuracy loss.

- `Product quantization / vector quantization` — compress embeddings to codebooks.

- `Pruning & sparse embeddings` — zero-out low-importance entries or use hashed embeddings.

- `Adaptive input size` — smaller embeddings for rare tokens (adaptive softmax/input).

## **Output-layer design and weight tying (practical impact)**

- `Untied output` — separate output projection $W$ allows different geometry between input and output spaces.

- `Tied weights` ($W = E^\top$) — reduces parameters and empirically improves perplexity and calibration in many setups (forces input and output geometry alignment).

- `Bias term` — adding bias $b$ per token helps model token priors (frequency).

## **Diagnostics and probing**

- `Nearest-neighbor checks` — verify semantically similar tokens are near in embedding space.

- `PCA / t-SNE visualization` — inspect clusters (POS, subword patterns).

- `Frequency vs quality plots` — check embedding norm / gradient magnitude vs token frequency.

- `Probing tasks` — lexical tasks (POS, morphology) to see what lexical info embeddings encode.

- `Embedding collapse detection` — watch for many embedding rows collapsing to similar vectors (often sign of bad LR or tokenization mismatch).

## **Practical guidelines & best practices (concise)**

- Use subword tokenization (BPE/WordPiece/Unigram) for most languages; prefer byte-level for robustness to noisy inputs.

- Choose embedding dimension $D$ consistent with model size; scale $D$ upward with model depth and attention heads.

- Tie weights between input and output for constrained budgets and often better calibration.

- For extremely large $V$, adopt adaptive input/softmax or factorization to save memory.

- For long-context extrapolation, tokens should remain consistent; positional strategy interacts with tokenization (e.g., byte-level increases sequence length — account for positional scheme).

- Monitor rare-token gradients and consider upsampling or using compositional encoders if many rare tokens exist.

## **Modern variations & research directions (short list)**

- `Subword regularization / sampling` — robustness by sampling alternate tokenizations during training.

- `Mixture-of-embeddings` — combine multiple embedding sources (lexical + morphological).

- `Explicit lexical priors` — incorporate POS, lemma, or morphological features into embeddings.

- `Cross-lingual shared vocabularies` — multilingual models share embeddings across languages; requires careful tokenization and script handling.

- `Sparse / retrieval-augmented embeddings` — embeddings stored in external index and retrieved when needed.

## **Short summary (one-line)**

Token embeddings convert discrete tokens to continuous vectors; design choices (tokenizer, $V$, $D$, tying, compression) crucially trade off expressivity, generalization, and computational cost.