## Lab 4: From Matrices to Tensors
Goal: build intuition for tensors (3D+ arrays), why they matter for AI/ML, and how shapes map to LLM workloads.

### Terminology
- Scalar (0D): single number. Think thermostat reading.
- Vector (1D): ordered list. Think a checklist of counts.
- Matrix (2D): rows x columns. Think spreadsheet table.
- Tensor (3D+): stacked matrices. Think a shelf of spreadsheets where depth encodes batch or time.

### What are Embeddings?

**Embeddings** are dense numeric vectors that represent tokens in a continuous space where similar meanings cluster together.

**Analogy:** Think of a city map with coordinates. Each location (token) has GPS coordinates (embedding vector). Nearby places (similar words) have similar coordinates. "King" and "Queen" are close; "King" and "Banana" are far apart.

**How it works:**
1. Each token ID maps to a vector of numbers (typically 768, 1024, or 4096 dimensions).
2. These vectors are learned during training to capture semantic relationships.
3. Similar tokens have similar vectors (measured by cosine similarity or dot product).

**Example:**
```
Token: "security" → Embedding: [0.23, -0.45, 0.87, ..., 0.12]  (768 dims)
Token: "safety"   → Embedding: [0.21, -0.43, 0.89, ..., 0.15]  (close to security!)
Token: "cloud"    → Embedding: [-0.67, 0.34, -0.21, ..., 0.45] (farther away)
```

**Why embeddings matter:**
- They capture meaning: "cat" and "kitten" have similar embeddings.
- They enable math on words: `king - man + woman ≈ queen`
- They're the input to attention mechanisms and neural network layers.
- Higher dimensions = more nuanced representations (but more memory/compute).

### What are Tokens?

**Tokens** are the basic units of text that AI models process. Think of tokenization as breaking sentences into digestible pieces—like cutting a sandwich into bite-sized chunks.

**Analogy:** Imagine you're organizing a library. Instead of processing entire books at once, you catalog them by chapters, pages, or even individual words. Tokens work similarly: they break text into manageable units.

**Examples:**
- **Word-level tokens:** "Hello world" → `["Hello", "world"]`
- **Subword tokens:** "unhappiness" → `["un", "happiness"]` or `["unhap", "piness"]`
- **Character tokens:** "AI" → `["A", "I"]`

Modern LLMs like GPT use subword tokenization (e.g., Byte-Pair Encoding). The sentence "CloudSecurity rocks!" might become `["Cloud", "Security", " rocks", "!"]`. Each token gets a unique ID from the model's vocabulary (dictionary).

**Why tokens matter:**
- Models don't understand raw text—they need numeric representations.
- Token limits (e.g., "8K context window") refer to the max number of tokens, not words.
- Tokenization affects how models understand rare words, code, or multilingual text.

In [3]:
# Setup
import numpy as np
np.set_printoptions(suppress=True, precision=3)

def describe(name, x):
    print(f"{name}:\n{x}")
    print(f"shape: {x.shape}, dtype: {x.dtype}\n")

print("NumPy version:", np.__version__)

NumPy version: 2.4.0


In [4]:
# Scalars, vectors, matrices, tensors
import numpy as np
scalar = np.array(42)
vector = np.array([1, 2, 3])
matrix = np.array([[1, 2, 3], [4, 5, 6]])
tensor3d = np.arange(2 * 3 * 4).reshape(2, 3, 4)  # shape: batches=2, rows=3, cols=4

describe('scalar', scalar)
describe('vector', vector)
describe('matrix', matrix)
describe('tensor3d', tensor3d)

scalar:
42
shape: (), dtype: int64

vector:
[1 2 3]
shape: (3,), dtype: int64

matrix:
[[1 2 3]
 [4 5 6]]
shape: (2, 3), dtype: int64

tensor3d:
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
shape: (2, 3, 4), dtype: int64



### Tensors in LLMs
- Each token is mapped to an embedding vector: shape `(embed_dim,)`.
- A sequence of tokens becomes a 2D array: `(seq_len, embed_dim)`.
- Multiple sequences in a batch become a 3D tensor: `(batch_size, seq_len, embed_dim)`.
- Attention uses matrices (Q, K, V) derived from this tensor; understanding shapes prevents mismatches and explains limits like context windows.

In [5]:
# Map a toy sequence to embeddings (2D)
vocab = {'hello': 0, 'world': 1, 'ai': 2, 'cloud': 3}
embed_dim = 4
embedding_table = np.random.randn(len(vocab), embed_dim)

sequence = ['hello', 'ai', 'world']
token_ids = [vocab[t] for t in sequence]
seq_embeddings = embedding_table[token_ids]  # shape (seq_len, embed_dim)

print('Sequence tokens:', sequence)
describe('seq_embeddings', seq_embeddings)
print(seq_embeddings)

Sequence tokens: ['hello', 'ai', 'world']
seq_embeddings:
[[ 0.312 -0.721 -0.357  0.281]
 [-1.241  2.169 -1.149  0.253]
 [-1.299 -0.166 -1.589 -1.584]]
shape: (3, 4), dtype: float64

[[ 0.312 -0.721 -0.357  0.281]
 [-1.241  2.169 -1.149  0.253]
 [-1.299 -0.166 -1.589 -1.584]]


In [6]:
# Stack multiple sequences into a batch (3D)
batch_sequences = [
    ['hello', 'world'],
    ['ai', 'cloud', 'world'],
    ['hello', 'ai']
]

# Pad sequences to the same length for batching
max_len = max(len(seq) for seq in batch_sequences)
pad_token = 'hello'  # reuse an existing token for simplicity
padded = [seq + [pad_token] * (max_len - len(seq)) for seq in batch_sequences]
token_id_batch = [[vocab[t] for t in seq] for seq in padded]
tensor_batch = embedding_table[token_id_batch]  # shape (batch_size, max_len, embed_dim)

print('Padded sequences:', padded)
describe('tensor_batch', tensor_batch)
print('tensor_batch sample for item 0:', tensor_batch[0])

Padded sequences: [['hello', 'world', 'hello'], ['ai', 'cloud', 'world'], ['hello', 'ai', 'hello']]
tensor_batch:
[[[ 0.312 -0.721 -0.357  0.281]
  [-1.299 -0.166 -1.589 -1.584]
  [ 0.312 -0.721 -0.357  0.281]]

 [[-1.241  2.169 -1.149  0.253]
  [ 1.459  0.803 -1.147  1.253]
  [-1.299 -0.166 -1.589 -1.584]]

 [[ 0.312 -0.721 -0.357  0.281]
  [-1.241  2.169 -1.149  0.253]
  [ 0.312 -0.721 -0.357  0.281]]]
shape: (3, 3, 4), dtype: float64

tensor_batch sample for item 0: [[ 0.312 -0.721 -0.357  0.281]
 [-1.299 -0.166 -1.589 -1.584]
 [ 0.312 -0.721 -0.357  0.281]]


### Attention shape walkthrough
From a 3D tensor `(batch, seq, embed)` we compute Q, K, V via matrix multiplies:
- Q = X @ Wq, K = X @ Wk, V = X @ Wv where W* are `(embed, head_dim)`.
- Attention scores use `Q @ K.T` per sequence; shapes must align.
- Softmax over sequence length keeps each token attending within its context window.

In [None]:
# Tiny attention demo on one sequence from the batch
seq_example = tensor_batch[0]  # shape (seq_len, embed_dim)
seq_len, embed_dim = seq_example.shape
head_dim = 3

Wq = np.random.randn(embed_dim, head_dim)
Wk = np.random.randn(embed_dim, head_dim)
Wv = np.random.randn(embed_dim, head_dim)

Q = seq_example @ Wq  # (seq_len, head_dim)
K = seq_example @ Wk  # (seq_len, head_dim)
V = seq_example @ Wv  # (seq_len, head_dim)
scores = Q @ K.T / np.sqrt(head_dim)  # (seq_len, seq_len)
attention_weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
context = attention_weights @ V  # (seq_len, head_dim)

describe('Q', Q)
describe('scores', scores)
describe('attention_weights', attention_weights)
describe('context', context)

### Why shapes matter
- Context window: seq_len limits how far tokens can attend; truncation or padding changes tensor shapes.
- Mixed data: stacking unrelated sequences in one batch can leak signals if masking is wrong.
- Performance: larger batch or seq_len inflates tensor sizes, memory, and latency.

## Student TODO: Build your own tensor
Create a batch of at least 2 sequences with 4-6 tokens each. Use a custom vocab and embedding dimension.

In [None]:
# TODO: define your vocab and embeddings
# vocab = {...}
# embed_dim = ...
# embedding_table = ...

# TODO: define your sequences and batch them
# sequences = [...]
# pad and convert to ids
# stacked = ...

# TODO: describe shapes and maybe compute a simple aggregate (mean across tokens)
# describe('my_tensor', stacked)
# token_mean = stacked.mean(axis=1)  # average per sequence
# print(token_mean)

## Student TODO: Debug a shape mismatch
Simulate an error by multiplying matrices with incompatible shapes, then fix it by reshaping or adjusting dims.

In [None]:
# TODO: create two arrays with mismatched shapes, observe the error
# a = np.random.randn(2, 3)
# b = np.random.randn(4, 2)
# try: a @ b
# except Exception as e: print('error:', e)

# TODO: fix the shapes (e.g., transpose b or change dimensions) and succeed
# b_fixed = b.T  # adjust as needed
# result = a @ b_fixed
# describe('result', result)

---
**Takeaways**
- Tensors generalize vectors and matrices to handle batches, time, and features together.
- LLMs rely on 3D tensors `(batch, seq, embed)` for embeddings and attention; shape fluency prevents bugs.
- Context windows, padding, and masking are shape questions first, algorithm questions second.