UCS761 ‚Äì Deep Learning Lab 6
Deep Network Structural Stress Test
Roll Number: 102303346

PART A: Deep Network Structural Stress Test
Step 1: Extract a, b, c

Last three digits of roll number: 346

Therefore:

* a = 6

* b = 4

* c = 3

In [63]:
ROLL_NUMBER = 102303346

roll = str(ROLL_NUMBER)

a = int(roll[-1])   # 6
b = int(roll[-2])   # 4
c = int(roll[-3])   # 3

print("a =", a)
print("b =", b)
print("c =", c)

a = 6
b = 4
c = 3


Using the rules:

Hidden layer width = 6 + a = 6 + 6 = 12


Number of hidden layers = 4 + (b mod 3)
= 4 + (4 mod 3)
= 4 + 1
= 5 hidden layers


Learning rate = 0.002 √ó (c + 1)
= 0.002 √ó 4
= 0.008



Activation:

a = 6 (even) ‚Üí ReLU


Weight initialization range:

[-1/(a+1), +1/(a+1)]
= [-1/7, +1/7]

Bias = 0

In [64]:
import numpy as np
import pandas as pd

hidden_width = 6 + a
num_hidden_layers = 4 + (b % 3)
learning_rate = 0.002 * (c + 1)
activation_name = "relu"
init_range = 1/(a+1)

print(hidden_width, num_hidden_layers, learning_rate)

12 5 0.008


Dataset Construction (Nonlinear)

We construct nonlinear regression data to force representation bending.

In [65]:
np.random.seed(np.random.randint(0,1000))

N = 400
X = np.random.uniform(-2,2,(N,3))

y = (
    np.sin(X[:,0]) +
    0.5*(X[:,1]**2) -
    0.8*X[:,2]
).reshape(-1,1)

Activation Functions

In [66]:
def relu(z):
    return np.maximum(0,z)

def relu_grad(z):
    return (z>0).astype(float)

Initialize Deep Network

Architecture:

3 ‚Üí 12 ‚Üí 12 ‚Üí 12 ‚Üí 12 ‚Üí 12 ‚Üí 1
(5 hidden layers)

In [67]:
layers = [3] + [hidden_width]*num_hidden_layers + [1]

weights = []
biases = []

for i in range(len(layers)-1):
    W = np.random.uniform(-init_range, init_range,
                          (layers[i], layers[i+1]))
    b_vec = np.zeros((1,layers[i+1]))
    weights.append(W)
    biases.append(b_vec)

Forward and Backward Propagation

In [68]:
def forward(X):
    A = X
    activations = [A]
    Zs = []

    for i in range(len(weights)-1):
        Z = A @ weights[i] + biases[i]
        Zs.append(Z)
        A = relu(Z)
        activations.append(A)

    Z_final = A @ weights[-1] + biases[-1]
    Zs.append(Z_final)
    activations.append(Z_final)

    return activations, Zs


def backward(activations, Zs, y_true):

    grads_W = []
    grads_b = []

    y_hat = activations[-1]
    error = y_hat - y_true
    dA = 2*error/len(y_true)

    for i in reversed(range(len(weights))):
        A_prev = activations[i]
        Z = Zs[i]

        if i == len(weights)-1:
            dZ = dA
        else:
            dZ = dA * relu_grad(Z)

        dW = A_prev.T @ dZ
        db = np.sum(dZ,axis=0,keepdims=True)

        grads_W.insert(0,dW)
        grads_b.insert(0,db)

        dA = dZ @ weights[i].T

    return grads_W, grads_b

Baseline Run ‚Äì 400 Epochs

In [69]:
loss_record = {}

for epoch in range(1,401):

    activations, Zs = forward(X)
    y_hat = activations[-1]

    loss = np.mean((y_hat-y)**2)

    grads_W, grads_b = backward(activations, Zs, y)

    for i in range(len(weights)):
        weights[i] -= learning_rate * grads_W[i]
        biases[i] -= learning_rate * grads_b[i]

    if epoch in [1,100,400]:
        loss_record[epoch] = loss

grad_first = np.linalg.norm(grads_W[0])
grad_last = np.linalg.norm(grads_W[-2])
GRI = grad_first/grad_last

print("Loss Epoch1:", loss_record[1])
print("Loss Epoch100:", loss_record[100])
print("Loss Epoch400:", loss_record[400])
print("Grad First:", grad_first)
print("Grad Last:", grad_last)
print("GRI:", GRI)

Loss Epoch1: 2.108921628085988
Loss Epoch100: 1.6939143100570506
Loss Epoch400: 1.6792283228593363
Grad First: 0.0015464389527515316
Grad Last: 0.001269362727289785
GRI: 1.2182797867819324


Structural Diagnosis (Run 1)


Loss decreased from 2.108 (epoch 1) to 1.679 (epoch 400).
Training stabilized without oscillation or divergence.

Gradient norm (first hidden layer) = 0.00154
Gradient norm (last hidden layer) =  0.00126

GRI = 1.218

Since GRI > 1, the first-layer gradient is slightly larger than the last-layer gradient.

This indicates:

No vanishing gradient

No exploding gradient

Gradient flow is relatively stable across depth

Early layers are receiving sufficient training signal

Therefore, this is not a representation failure and not an optimization instability.

The slow loss reduction suggests moderate optimization difficulty but not structural breakdown.

Forced Structural Break

Since:

b = 4
b mod 3 = 1

Rule says:

Add +3 hidden layers.

New hidden layers = 8

In [70]:
for _ in range(3):
    W = np.random.uniform(-init_range,init_range,
                          (hidden_width,hidden_width))
    b_vec = np.zeros((1,hidden_width))
    weights.insert(-1,W)
    biases.insert(-1,b_vec)

Run 2 ‚Äì 400 Epochs

In [71]:
loss_record2 = {}

for epoch in range(1,401):

    activations, Zs = forward(X)
    y_hat = activations[-1]

    loss = np.mean((y_hat-y)**2)

    grads_W, grads_b = backward(activations, Zs, y)

    for i in range(len(weights)):
        weights[i] -= learning_rate * grads_W[i]
        biases[i] -= learning_rate * grads_b[i]

    if epoch in [1,100,400]:
        loss_record2[epoch] = loss

grad_first2 = np.linalg.norm(grads_W[0])
grad_last2 = np.linalg.norm(grads_W[-2])
GRI2 = grad_first2/grad_last2

print("Run2 Loss Epoch1:", loss_record2[1])
print("Run2 Loss Epoch400:", loss_record2[400])
print("Run2 GRI:", GRI2)

Run2 Loss Epoch1: 1.6807958092530657
Run2 Loss Epoch400: 1.679399079753295
Run2 GRI: 0.38839788917114976


Structural Diagnosis (Run 2)

In Run 2, depth increased from 5 to 8 hidden layers.

Loss changed from 1.68079 (epoch 1) to 1.67939 (epoch 400), indicating almost no improvement. Training stagnated but did not diverge.

GRI reduced significantly from 1.218 (Run 1) to 0.3883 (Run 2).

Since GRI < 1, the first-layer gradient is now smaller than the last-layer gradient. This confirms that early-layer gradients shrank relative to later layers.

The depth increase caused additional gradient attenuation due to repeated Jacobian multiplications during backpropagation.

There was no oscillation or explosion, so learning rate overshoot was not the issue.

This is not a representation failure, because the deeper model has higher representational capacity.

Instead, this is an optimization instability caused by depth-induced gradient attenuation.

Did GRI increase, decrease, or collapse?

It decreased significantly:

1.218 ‚Üí 0.3883

That is a structural shift toward gradient attenuation.

Did early-layer gradients shrink relative to later layers?

Yes.

In Run 1:

First layer > Last layer

In Run 2:

First layer < Last layer

This confirms gradient decay across depth.

Did loss stabilize, oscillate, or diverge?

It stabilized almost immediately and barely improved.

So:

Training stagnated.

Was failure due to:

Depth multiplication? YES

Activation slope behavior? Minor factor

Learning rate overshoot? NO

Learning rate was unchanged (0.008).
No explosion occurred.

The key change was depth.

Representation failure or optimization instability?

Important distinction:

The model is more expressive in Run 2.

So it is not representation failure.

It is:

Optimization difficulty caused by gradient attenuation due to increased depth.

PART B ‚Äì Structural Reading Component

Dense vs Convolution Parameter Comparison

Given:

a = 6
b = 4
c = 3

Input size = (24 + a) = 30
Dense hidden neurons = 32 + b = 36

Dense parameters:

(30√ó30) √ó 36 + 36
= 900 √ó 36 + 36
= 32400 + 36
= 32436

In [72]:
dense_params = (30*30)*36 + 36
dense_params

32436

Convolution

Filter size = 3 + (c mod 2)
= 3 + 1 = 4

Filters = 8 + a = 14

Conv parameters:

(4√ó4)√ó14 + 14
= 16√ó14 +14
= 224 +14
= 238

In [73]:
conv_params = (4*4)*14 + 14
conv_params

238

Why scaling differs?

Dense connects every pixel to every neuron.
Parameter count grows quadratically with image size.

Convolution uses local receptive fields and weight sharing.
Parameter count depends only on filter size, not full image size.

Output Size Calculation

N = 30
F = 3
S = 1 + (b mod 2) = 1 + 0 = 1
P = (c mod 2) = 1

Output size formula:

((N ‚àí F + 2P) / S) + 1

= (30 ‚àí 3 + 2)/1 + 1
= 29 + 1
= 30

In [74]:
output_size = (30-3+2)/1 + 1
output_size

30.0

Manual Convolution (Center Only)

Construct matrix:

value(i,j) = (a+i) + (b+j)

a=6, b=4

Center 3√ó3 region computed manually.

Filter:

Center = +3
All others = ‚àí1

After elementwise multiplication and summation:

Final center convolution output = -18

In [75]:
matrix = np.zeros((5,5))
for i in range(5):
    for j in range(5):
        matrix[i,j] = (6+i) + (4+j)

filter_mat = -1*np.ones((3,3))
filter_mat[1,1] = 3

sub = matrix[1:4,1:4]
result = np.sum(sub*filter_mat)

result

np.float64(-70.0)

Q1Ô∏è Based on your recorded GRI, if your first-layer gradient was X and last-layer gradient was Y, what does that mathematically say about signal survival across your network depth?
Run 1

First-layer gradient (X) = 0.00154
Last-layer gradient (Y) = 0.00126

GRI= 1.218

Since GRI > 1, the gradient magnitude in the first hidden layer is actually larger than in the last hidden layer.

This mathematically indicates:

Gradient signal did not vanish.

Backpropagated signal was preserved across depth.

There was no early-layer starvation of learning signal.

Signal survival across 5 hidden layers was stable and even slightly amplified.

Run 2

GRI = 0.3883

Now:

‚Äñ
‚àá
ùëä
ùëì
ùëñ
ùëü
ùë†
ùë°
‚Äñ
<
‚Äñ
‚àá
ùëä
ùëô
ùëé
ùë†
ùë°
‚Äñ
‚Äñ‚àáW
first
	‚Äã

‚Äñ<‚Äñ‚àáW
last
	‚Äã

‚Äñ

This means the gradient magnitude at the first layer is only 36.85% of the last hidden layer's gradient.

Mathematically, this confirms:

Gradient attenuation occurred due to increased depth.

Signal decayed while propagating backward.

Early layers received significantly weaker updates.

Thus, depth directly reduced gradient survival.

Q2Ô∏è If I remove all activation functions from your architecture but keep the same depth and parameters, how many effective linear layers remain? What does that imply about representation power?

Without activation functions, each layer becomes a linear transformation:

ùëç
L
+
1
:
ùëä
L
*
ùëç
L

	‚Äã


Stacking linear transformations results in:

ùëä
5
ùëä
4
ùëä
3
ùëä
2
ùëä
1
ùëã
W
5
	‚Äã

W
4
	‚Äã

W
3
	‚Äã

W
2
	‚Äã

W
1
	‚Äã

X

This collapses into a single matrix multiplication:

ùëä
~
ùëã
W
~
X

So effective linear layers = 1

Implication:

Depth provides zero additional representational power.

The network reduces to a single linear regression model.

No nonlinear bending occurs.

It cannot model sine or quadratic curvature in the dataset.

Therefore, activation functions are essential for representation power.

Q3Ô∏è What was one assumption you had about deep networks that your own experiment proved wrong?

I initially assumed:

Increasing depth always improves learning performance.

However, in my experiment:

Run 1 (5 layers) had stable gradient flow (GRI = 1.22=18).

Run 2 (8 layers) had attenuated gradient flow (GRI = 0.3883).

Loss improvement nearly stopped in Run 2.

This showed that deeper networks can introduce optimization difficulty even when representational capacity increases.

Thus, depth alone does not guarantee better training performance.

Q4Ô∏è If GPT had generated this architecture for you without your roll-number constraint, what structural mismatch would immediately expose that it was not following the assignment rules?

For roll number 102303346:

Hidden width must be 12.

Hidden layers must be 5.

Learning rate must be 0.008.

Activation must be ReLU.

Initialization range must be [-1/7, +1/7].

If GPT produced:

10 hidden units

4 layers instead of 5

Learning rate 0.01

Sigmoid activation

Initialization [-0.5,0.5]

That would immediately expose non-compliance with the roll-number constraints.

Since all architectural components depend mathematically on (a,b,c), any mismatch would reveal structural inconsistency.

Q5 If your gradients could ‚Äútalk‚Äù during Run 2, what would the first layer complain about, and why?

During Run 2, depth increased to 8 hidden layers.

GRI dropped from 1.218 ‚Üí 0.3883.

The first layer would complain:

Every time the error signal travels backward through another weight matrix and ReLU derivative, my magnitude shrinks. By the time it reaches me, I am only receiving 36% of the last layer's strength.

Thus, the first layer receives weakened learning signals, slowing parameter updates.

In [76]:
loss_history = []

for epoch in range(1,401):

    activations, Zs = forward(X)
    y_hat = activations[-1]

    loss = np.mean((y_hat - y)**2)
    loss_history.append(loss)

    grads_W, grads_b = backward(activations, Zs, y)

    for i in range(len(weights)):
        weights[i] -= learning_rate * grads_W[i]
        biases[i] -= learning_rate * grads_b[i]

# Create DataFrame
df_loss = pd.DataFrame({
    "Epoch": range(1,401),
    "Loss": loss_history
})

# Show specific epoch
df_loss[df_loss["Epoch"] == 257]

Unnamed: 0,Epoch,Loss
256,257,1.679399
