Name: Shubhajeet Das <br />
Roll No.: 24AI10013 <br />
Date: 6 Jan 2026

## Deep Learning (PyTorch): Assignment 1

## Create A 3-dimensional tensor X representing a batch of padded sequences:

[Marks 1]

* Batch size B = 6

* Maximum sequence length T = 10

* Feature dimension F = 8

* Values must be sampled from a standard normal distribution

* The tensor must participate in backpropagation



In [6]:
import torch
X = torch.randn(6, 10, 8, requires_grad=True)
print("Tensor X created with shape:", X.shape)
print("Requires gradient:", X.requires_grad)

Tensor X created with shape: torch.Size([6, 10, 8])
Requires gradient: True


### Create a mask of shape (6, 10) containing 0, 1 only.

[Marks 1]


In [9]:
mask = torch.randint(0, 2, (6, 10))
print("Mask created with shape:", mask.shape)
print("Mask content sample:\n", mask)

Mask created with shape: torch.Size([6, 10])
Mask content sample:
 tensor([[1, 0, 1, 1, 1, 1, 1, 0, 0, 0],
        [0, 1, 1, 1, 0, 0, 1, 0, 0, 0],
        [1, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [1, 0, 1, 0, 0, 0, 0, 0, 1, 1],
        [0, 1, 1, 0, 1, 1, 0, 1, 0, 1],
        [0, 0, 0, 0, 0, 1, 0, 1, 1, 0]])


## Masked Reduction
[marks 7]

Compute a tensor H ∈ ℝ^{B×F}:
- Must rely on broadcasting + reduction
- Operation must be numerically stable

In [10]:
mask_unsqueezed = mask.unsqueeze(-1)
masked_X = X * mask_unsqueezed
sum_masked_X = masked_X.sum(dim=1)
lengths = mask.sum(dim=1, keepdim=True)
H = sum_masked_X / (lengths + 1e-9)

print("Tensor H created with shape:", H.shape)
print("H sample:\n", H)

Tensor H created with shape: torch.Size([6, 8])
H sample:
 tensor([[-0.2134, -0.7200,  0.8103,  0.1301,  1.0648, -0.9419,  0.1278,  0.7643],
        [-0.3521,  1.2659,  0.5668,  0.1090, -0.7020,  0.4657,  0.0454,  0.5968],
        [ 1.0081, -0.1255,  0.3033, -0.9279, -0.3497,  0.7482, -0.4652,  0.4079],
        [ 0.5772, -0.2298,  0.4986,  0.1489, -0.2169, -0.2485, -0.2643,  0.4779],
        [ 0.3829, -0.7855,  0.3554,  0.3611, -0.3511,  0.2494,  0.0743,  0.2733],
        [-0.7410, -0.3905,  0.4918, -0.4858, -0.5172,  0.2507, -0.2818,  0.8785]],
       grad_fn=<DivBackward0>)


## Matrix Multiplication

In [11]:
## You are given (given Piece of code.)
import torch
X = torch.randn(32, 64, requires_grad=True)
W = torch.randn(64, 16, requires_grad=True)
b = torch.randn(16, requires_grad=True)


### Compute:
[Marks 2]

Y=XW+b

such that broadcasting of b happens implicitly.

In [12]:
Y = X @ W + b
print("Tensor Y created with shape:", Y.shape)
print("Y sample (first row):\n", Y[0])

Tensor Y created with shape: torch.Size([32, 16])
Y sample (first row):
 tensor([  1.4631,  -9.6281,   7.0019,   6.3223,   2.9396,   4.1752,   5.8190,
         -5.9924,  -2.6496, -21.5498,   7.0579,   1.7749,  13.6921,   3.0240,
         -7.9628,   4.7901], grad_fn=<SelectBackward0>)


### Compute a scalar loss:
[Marks 1]

L=∑Y^2

In [13]:
L = (Y ** 2).sum()
print("Scalar loss L computed:", L)

Scalar loss L computed: tensor(33341.1914, grad_fn=<SumBackward0>)


### Objective
[Marks 3]

* What is the shape of Y?

* Which dimension(s) does PyTorch reduce over during backpropagation into b?

* Why would X @ W.T silently fail or produce wrong shapes?


*   **What is the shape of Y?**
    The shape of `Y` is **(32, 16)**.

*   **Which dimension(s) does PyTorch reduce over during backpropagation into b?**
    When computing `Y = X @ W + b`, `b` is effectively added to each row of `X @ W`. If `Y` has shape (B, F) and `b` has shape (F,), then during backpropagation, `dL/db` is obtained by summing `dL/dY` along the batch dimension (dimension 0). This is because `b` affects each row of `Y` identically, so its gradient accumulates contributions from all rows.
    Therefore, PyTorch reduces over **dimension 0 (the batch dimension)** during backpropagation into `b`.

*   **Why would X @ W.T silently fail or produce wrong shapes?**
    `X` has shape (32, 64).
    `W` has shape (64, 16).
    `W.T` (transpose of W) would have shape (16, 64).

    For matrix multiplication `A @ B`, the inner dimensions must match.
    In `X @ W.T`:
    - `X` has shape (32, **64**)
    - `W.T` has shape (**16**, 64)
    The inner dimensions are 64 and 16, which **do not match**. Therefore, `X @ W.T` would **silently fail** with a runtime error (e.g., `RuntimeError: mat1 and mat2 shapes cannot be multiplied`).

## Backpropagation

### Code

```python
import torch

B = 4  # batch_size

X = torch.randn(B, 6, requires_grad=True)

W1 = torch.randn(6, 5, requires_grad=True)
b1 = torch.randn(5, requires_grad=True)

W2 = torch.randn(5, 3, requires_grad=True)
b2 = torch.randn(3, requires_grad=True)

H = X @ W1 + b1            #1
A = torch.relu(H.detach()) #2    
Y = A @ W2 + b2            #3
L = (Y ** 2).mean()        #4
```

### Code Reading,
[Marks 1]

One of the given 4 lines (\#1, \#2, \#3, \#4) is wrong. Identify that line and justify that line.

**The wrong line is: `#2 A = torch.relu(H.detach())`**

**Justification:**
The `detach()` method creates a new tensor that is detached from the current computational graph. This means that any operations performed *after* `H.detach()` will not have their gradients propagated back through `H` (and consequently, not through `X`, `W1`, or `b1`). By detaching `H` before applying the `relu` activation and feeding it into the next layer, we are effectively preventing gradient flow from `Y` and `L` back to `H`, `W1`, `b1`, and `X`.

### Without running the code(after correcting the wrong line), determine the exact shape of the following tensors:
[Marks 2]

- H -> (4, 5)

- A -> (4, 5)

- Y -> (4, 3)

- L -> () scalar

### After Back Propagation,
[Marks 2]
``` python
L.backward()
```
After applyng back-propagate
State the shape of the gradient for each of the following:

∂L / ∂H

∂L / ∂W1

∂L / ∂b1

∂L / ∂X

In [17]:
B = 4  # batch_size

X = torch.randn(B, 6, requires_grad=True)

W1 = torch.randn(6, 5, requires_grad=True)
b1 = torch.randn(5, requires_grad=True)

W2 = torch.randn(5, 3, requires_grad=True)
b2 = torch.randn(3, requires_grad=True)

H = X @ W1 + b1
A = torch.relu(H)
Y = A @ W2 + b2
L = (Y ** 2).mean()
L.backward()

*   **∂L / ∂H**: The gradient `dL/dH` will have the same shape as `H`. From the previous step, `H` has shape `(4, 5)`. Therefore, `∂L / ∂H` will have shape **(4, 5)**.

*   **∂L / ∂W1**: The gradient `dL/dW1` will have the same shape as `W1`. `W1` has shape `(6, 5)`. Therefore, `∂L / ∂W1` will have shape **(6, 5)**.

*   **∂L / ∂b1**: The gradient `dL/db1` will have the same shape as `b1`. `b1` has shape `(5,)`. Therefore, `∂L / ∂b1` will have shape **(5,)**.

*   **∂L / ∂X**: The gradient `dL/dX` will have the same shape as `X`. `X` has shape `(4, 6)`. Therefore, `∂L / ∂X` will have shape **(4, 6)**.