# Day 7: Tensors, Broadcasting & torch.Tensor Deep Dive

**Building LLMs from Scratch** — Following Andrej Karpathy's makemore lectures.

---

## 1. Introduction

Understanding PyTorch tensors: shapes, broadcasting, and the operations that make neural nets work. Tensors are the fundamental data structure—scalars, vectors, matrices, and higher-dimensional arrays. Broadcasting lets us write concise, efficient code without explicit loops. Mastering these concepts is essential for building and debugging neural networks.

## 2. Tensor Basics

Create scalars, vectors, matrices, and 3D tensors. Inspect `.shape`, `.dtype`, and `.device`.

In [None]:
import torch

# Scalar (0-dimensional)
s = torch.tensor(3.14)
print("Scalar:", s, "| shape:", s.shape, "| dtype:", s.dtype)

# Vector (1D)
v = torch.tensor([1.0, 2.0, 3.0])
print("Vector:", v, "| shape:", v.shape)

# Matrix (2D)
M = torch.tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
print("Matrix shape:", M.shape, "→", M.shape[0], "rows ×", M.shape[1], "cols")

# 3D tensor (e.g., batch of matrices)
T = torch.randn(2, 3, 4)
print("3D tensor shape:", T.shape)

# Device (CPU by default)
print("Device:", M.device)

## 3. Broadcasting Rules

Broadcasting automatically expands smaller tensors to match larger ones for element-wise ops. Rules: (1) align shapes from the right, (2) dimensions are compatible if equal or one is 1, (3) missing dims are treated as 1.

In [None]:
import torch

# Vector + scalar: scalar broadcasts to every element
v = torch.tensor([1.0, 2.0, 3.0])
print("Vector + scalar:", v + 10)

# Matrix + row vector: row broadcasts to every row
M = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
row = torch.tensor([100.0, 200.0, 300.0])  # shape (3,)
print("Matrix + row vector:")
print(M + row)

# Matrix + column vector: column broadcasts to every column
col = torch.tensor([[10.0], [20.0]])  # shape (2, 1)
print("Matrix + column vector:")
print(M + col)

In [None]:
# Broadcasting FAILS: incompatible shapes
try:
    A = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # (2, 2)
    B = torch.tensor([1.0, 2.0, 3.0])          # (3,) — incompatible!
    C = A + B
except RuntimeError as e:
    print("Expected error:", e)

## 4. Row-wise Normalization

Normalize each row to sum to 1. Use `keepdim=True` so the sum has shape `(n, 1)` and broadcasts correctly when dividing.

In [None]:
import torch

torch.manual_seed(42)
M = torch.rand(3, 4)
print("Original matrix:")
print(M)
print("Row sums (before norm):", M.sum(1))

# With keepdim=True: sum has shape (3, 1) → broadcasts for division
row_sums = M.sum(1, keepdim=True)
M_norm = M / row_sums
print("\nNormalized (each row sums to 1):")
print(M_norm)
print("Row sums (after norm):", M_norm.sum(1))

## 5. One-Hot Encoding

Use `torch.nn.functional.one_hot` to encode integers into a binary matrix. Each integer becomes a row with a 1 at its index.

In [None]:
import torch
import torch.nn.functional as F

# Encode indices 0, 2, 1 with vocabulary size 4
indices = torch.tensor([0, 2, 1])
one_hot = F.one_hot(indices, num_classes=4)
print("Indices:", indices)
print("One-hot encoding (each row = one sample):")
print(one_hot)
print("Shape:", one_hot.shape)

## 6. Matrix Multiply for Neural Nets

The `xenc @ W` pattern: one-hot encoded input times weight matrix = logits. This is the core forward pass of a linear layer.

In [None]:
import torch
import torch.nn.functional as F

torch.manual_seed(42)

# 5 samples, each is an index in vocab of size 27 (e.g., characters)
ix = torch.tensor([0, 5, 13, 0, 1])
xenc = F.one_hot(ix, num_classes=27).float()  # (5, 27)

# Weight matrix: 27 inputs → 27 outputs (logits per class)
W = torch.randn(27, 27)

# Forward: xenc @ W = logits
logits = xenc @ W  # (5, 27) @ (27, 27) → (5, 27)
print("xenc shape:", xenc.shape)
print("W shape:", W.shape)
print("logits shape:", logits.shape)
print("\nLogits (first 2 rows):")
print(logits[:2])

## 7. Softmax

Implement softmax manually: `counts = logits.exp(); probs = counts / counts.sum(1, keepdim=True)`. Converts logits to probabilities that sum to 1 per row.

In [None]:
import torch

logits = torch.tensor([[1.0, 2.0, 3.0], [0.5, 1.0, 0.1]])

# Softmax: exp(logits) / sum(exp(logits)) per row
counts = logits.exp()
probs = counts / counts.sum(1, keepdim=True)

print("Logits:")
print(logits)
print("\nProbabilities (each row sums to 1):")
print(probs)
print("Row sums:", probs.sum(1))

# Compare with F.softmax
probs_ref = torch.softmax(logits, dim=1)
print("\nMatch F.softmax:", torch.allclose(probs, probs_ref))

## 8. Common Pitfalls

Watch out for: missing `keepdim=True`, integer vs float tensors, and in-place operations.

In [None]:
import torch

# Pitfall 1: Missing keepdim=True — wrong shape for broadcasting
M = torch.rand(3, 4)
s_wrong = M.sum(1)       # shape (3,) — can cause subtle bugs
s_right = M.sum(1, keepdim=True)  # shape (3, 1) — broadcasts correctly
print("Without keepdim:", s_wrong.shape)
print("With keepdim:", s_right.shape)

# Pitfall 2: Integer tensors — many ops require float
x_int = torch.tensor([1, 2, 3])
# x_int.exp()  # would fail: exp not for integers
x_float = x_int.float()
print("\nInteger .float():", x_float.dtype)

# Pitfall 3: In-place ops — breaks autograd, use sparingly
a = torch.tensor([1.0, 2.0], requires_grad=True)
# a.add_(1)  # in-place: dangerous with requires_grad
b = a + 1   # out-of-place: safe
print("\nOut-of-place is safer for autograd.")

---

**Blog:** [Day 7 — Tensors, Broadcasting & torch.Tensor Deep Dive](https://omkarray.com/llm-day7.html)

**Prev:** [Day 6 — Bigram Model](llm_day06_bigram.ipynb) · **Next:** [Day 8 — MLP Language Model](llm_day08_mlp_lm.ipynb)