
# PyTorch Broadcasting — Problem Set

**Tip:** Broadcasting compares dimensions from right to left. Two dimensions are compatible if they are equal, or if either is 1. Pytorch will automatically add dimensions of size 1 to the left of the smaller dimensioned tensor if needed

In [3]:
# (Optional) If running locally, ensure PyTorch is installed.
# !pip install torch -q

import torch

def shape(t): 
    return tuple(t.shape)

torch.__version__ if 'torch' in globals() else 'PyTorch not installed in this environment.'


'2.2.2+cu121'


## Applying to Neural Nets
![](./dimensioned_nn.png)<br><br>


In [6]:
X=torch.ones(32,2)
print(f'X.shape = {X.shape}')
# print(X)

w=torch.ones(2,3)*2
print(f'w.shape = {w.shape}')
# print(w)

b=torch.ones(1,3)*3
print(f'b.shape = {b.shape}')
# print(b)

#matrix multiply X and w
Y=X@w
print(f'Y.shape = {Y.shape}')
# print(Y)

#add b to Y
Z=Y+b
print(f'Z.shape = {Z.shape}')
# print(Z)
#Note: b is automatically broadcast to match the shape of Y


X.shape = torch.Size([32, 2])
w.shape = torch.Size([2, 3])
b.shape = torch.Size([1, 3])
Y.shape = torch.Size([32, 3])
Z.shape = torch.Size([32, 3])



## Part A — Can these two tensors broadcast? If **yes**, what is the result shape?



**A1.** `A: (5, 1, 7)` and `B: (1, 3, 7)`  

**A2.** `A: (3, 1, 4)` and `B: (2, 5)`


In [None]:
# Your work (optional): Try constructing dummy tensors to see if operations succeed.
# Example for A1:
# A = torch.empty(5, 1, 7)
# B = torch.empty(1, 3, 7)
# print((A + B).shape)  # or any element-wise op
#
# Example for A2:
# A = torch.empty(3, 1, 4)
# B = torch.empty(2, 5)
# # The following should raise an error if not broadcastable:
# # print((A + B).shape)



## Part B — If broadcasting is possible, **show how** (via `unsqueeze` or  [..,None,..] ), otherwise explain exactly why not.


**B1.** `A: (10, 1, 5)` and `b: (5,)`. Broadcastable? If yes, give the result shape and identify which dimension(s) of which tensor expand.  

**B2.** Add a per-example bias: `x: (64, 1, 28, 28)`, `b: (64,)`.  
Add `b` to every element of the corresponding sample in `x`. Is this broadcastable? If so, how would you write the addition?

**B3.** Multiply an image batch by per-channel scales: `img: (16, 3, 224, 224)`, `w: (1, 3, 1, 1)`.  
Is this directly broadcastable? If yes, what expands where?


In [None]:
# Your work (optional):
# A = torch.empty(10, 1, 5)
# b = torch.empty(5)
# # print((A + b).shape)

# x = torch.empty(64, 1, 28, 28)
# b = torch.empty(64)
# # Try adding with a reshaped b:
# # y = x + b.view(64, 1, 1, 1)
#
# img = torch.empty(16, 3, 224, 224)
# w = torch.empty(1, 3, 1, 1)
# # Try multiplication:
# # out = img * w



## Part C — `torch.matmul` broadcasting (batch dims broadcast; matrix dims must align)



**C1.** `X: (10, 3, 4)` and `y: (4,)` with `X @ y`.  
Valid? If yes, what’s the result shape and why?

**C2.** `A: (5, 3, 4)` and `B: (2, 4, 6)` with `A @ B`.  
Valid? If not, explain exactly which dimensions fail to broadcast; then propose a small change that would make it valid.


In [None]:
# Your work (optional):
# X = torch.empty(10, 3, 4)
# y = torch.empty(4)
# # print((X @ y).shape)
#
# A = torch.empty(5, 3, 4)
# B = torch.empty(2, 4, 6)
# # This will error because batch dims (5) and (2) do not broadcast:
# # print((A @ B).shape)



## Part D — Practical refactoring (“can it broadcast, and if so how?”)



**D1.** You have `scores: (batch=128, heads=4, query_len=50, key_len=50)` and `pad: (batch=128, key_len=50)` with 0s for keep and −inf for masked positions.  
Without allocating a large temporary, show how to apply `pad` to `scores` using broadcasting. What view/unsqueeze shape should `pad` take?

**D2.** Subtract a per-channel mean from NCHW activations: `x: (64, 3, 32, 32)`, `mean: (3,)`.  
Show two equivalent approaches: one explicit via indexing/unsqueeze, one idiomatic relying on broadcasting.


In [None]:
# Your work (optional):
# scores = torch.empty(128, 4, 50, 50)
# pad = torch.empty(128, 50)
# # Apply broadcasting mask over last dimension (key_len):
# # scores = scores + pad.view(128, 1, 1, 50)
#
# x = torch.empty(64, 3, 32, 32)
# mean = torch.empty(3)
# # Explicit:
# # y1 = x - mean.view(1, 3, 1, 1)
# # Idiomatic:
# # y2 = x - mean[None, :, None, None]



---

# Answer Key

### Part A
**A1.** Yes → result shape **(5, 3, 7)**.  
Right-to-left: `7 vs 7` ✔, `1 vs 3` → `1` expands to `3`, `5 vs 1` → `1` expands to `5`.

**A2.** **No.** Rightmost dims `4` (from `(3,1,4)`) and `5` (from `(2,5)`) conflict and neither is `1`.

### Part B

**B1.** Yes → result **(10, 1, 5)**.  
`b: (5,)` behaves like `(1, 5)` and expands across the `10` and `1` leading dims of `A`.

**B2.** Yes. Add with a reshaped bias:  
`y = x + b.view(64, 1, 1, 1)` (or `b[:, None, None, None]`). The `1`s broadcast across spatial dims and the sample-aligned `64` matches.

**B3.** Yes. `(16, 3, 224, 224) * (1, 3, 1, 1)` → result `(16, 3, 224, 224)`.  
Batch `16` expands from `1` in `w`; channels match at `3`; spatial `224`s match.

### Part C
**C1.** Valid → result **(10, 3)**.  
Each `(3×4)` matrix in `X` multiplies the `(4,)` vector (treated as `(4,1)` then squeezed), yielding `(3,)` per batch; batch size `10` is preserved.

**C2.** **Invalid** as given.  
Batch dims are `(5)` vs `(2)`, which **do not broadcast** (neither is `1`). Matrix inner dims `4 @ 4` are fine.  
**Fix:** Use a single shared `B`, e.g. make `B` shape `(4, 6)` or `(1, 4, 6)`, or repeat to `(5, 4, 6)` explicitly if you truly have two distinct `B`s and want to tile.

### Part D
**D1.** View `pad` as `(128, 1, 1, 50)` and add:  
`scores = scores + pad.view(128, 1, 1, 50)`  
This broadcasts `pad` across `heads` and `query_len` while aligning to the last (key) dimension.

**D2.** Two equivalent approaches:  
- **Explicit:** `y1 = x - mean.view(1, 3, 1, 1)`  
- **Idiomatic:** `y2 = x - mean[None, :, None, None]`

---

**Optional sanity-check cell (run locally):** Try constructing dummy tensors with the shapes above and performing the ops to observe resulting shapes or errors.
