# So, the thing is that --- what if we ar given constant weights!?

Let's create a simple dataset, and just do a single forward pass -- then we will see how the back prop changes the weights.

In [1]:
%matplotlib inline
import torch
from d2l import torch as d2l
import matplotlib.pyplot as plt

In [65]:
import numpy as np
from sklearn.datasets import make_regression

# Parameters
n_samples = 1000  # Number of samples
n_features = 5     # Number of features
noise = 0.1        # Standard deviation of the gaussian noise

# Generate synthetic data
X, y, coefficients = make_regression(n_samples=n_samples, 
                                     n_features=n_features, 
                                     noise=noise, 
                                     coef=True, 
                                     random_state=42)

In [67]:
X = torch.from_numpy(X).type(torch.float32)
y = torch.from_numpy(y).type(torch.float32)

In [68]:
# STEP - 1: *CONSTANT WEIGHTS*
weights = torch.full((n_features, 1), 0.42, requires_grad=True) # fill with constant
bias = torch.tensor(.123, requires_grad=True)

In [69]:
weights, bias

(tensor([[0.4200],
         [0.4200],
         [0.4200],
         [0.4200],
         [0.4200]], requires_grad=True),
 tensor(0.1230, requires_grad=True))

In [70]:
# STEP - 2: forward pass
prediction = torch.matmul(X, weights) + bias

In [73]:
# initial prediction
prediction[:5]

tensor([[ 1.1458],
        [-0.1861],
        [-0.2003],
        [-1.0051],
        [-1.6343]], grad_fn=<SliceBackward0>)

In [74]:
# STEP - 3: Calculate the loss
loss = (y - prediction) ** 2 / 2
loss = loss.mean()

In [75]:
# currently we don't have any grad
type(weights.grad)

NoneType

In [76]:
# STEP - 4: Calculate the grad
loss.backward()

In [77]:
# now we have the grad!
type(weights.grad), weights.grad

(torch.Tensor,
 tensor([[0.3611],
         [0.4411],
         [0.4365],
         [0.3894],
         [0.4393]]))

Apart from having the **exactly the same weights** now we have different grads for each weights. That's precisely because we have different features.

## Let's make the features the same as well!

In [84]:
X = torch.ones(1000, n_features)

In [85]:
# STEP - 1: *CONSTANT WEIGHTS*
weights = torch.full((n_features, 1), 0.42, requires_grad=True) # fill with constant
bias = torch.tensor(.123, requires_grad=True)

In [86]:
weights, bias

(tensor([[0.4200],
         [0.4200],
         [0.4200],
         [0.4200],
         [0.4200]], requires_grad=True),
 tensor(0.1230, requires_grad=True))

In [87]:
# STEP - 2: forward pass
prediction = torch.matmul(X, weights) + bias

In [88]:
# initial prediction
prediction[:5]

tensor([[2.2230],
        [2.2230],
        [2.2230],
        [2.2230],
        [2.2230]], grad_fn=<SliceBackward0>)

> See? The predictions are the same... that's expected as all values are `1`.\

In [89]:
# STEP - 3: Calculate the loss
loss = (y - prediction) ** 2 / 2
loss = loss.mean()

In [90]:
# currently we don't have any grad
type(weights.grad)

NoneType

In [91]:
# STEP - 4: Calculate the grad
loss.backward()

In [92]:
# now we have the grad!
type(weights.grad), weights.grad

(torch.Tensor,
 tensor([[1.3253],
         [1.3253],
         [1.3253],
         [1.3253],
         [1.3253]]))

> Cool. Now we have the **same** grads across all weights.

## Let's now make the weights *different* but the same data...

In [93]:
# same data
X = torch.ones(1000, n_features)

In [107]:
# STEP - 1: Random weights
weights = torch.randn(n_features, 1, requires_grad=True)
bias = torch.randn(1, requires_grad=True)

In [95]:
weights, bias

(tensor([[-0.0658],
         [-0.4783],
         [-0.6786],
         [ 1.3558],
         [ 0.6845]], requires_grad=True),
 tensor([0.2683], requires_grad=True))

In [96]:
# STEP - 2: forward pass
prediction = torch.matmul(X, weights) + bias

In [97]:
# initial prediction
prediction[:5]

tensor([[1.0859],
        [1.0859],
        [1.0859],
        [1.0859],
        [1.0859]], grad_fn=<SliceBackward0>)

> Again same predictions.

In [98]:
# STEP - 3: Calculate the loss
loss = (y - prediction) ** 2 / 2
loss = loss.mean()

In [99]:
# currently we don't have any grad
type(weights.grad)

NoneType

In [100]:
# STEP - 4: Calculate the grad
loss.backward()

In [101]:
# now we have the grad!
type(weights.grad), weights.grad

(torch.Tensor,
 tensor([[0.1882],
         [0.1882],
         [0.1882],
         [0.1882],
         [0.1882]]))

> That means, no matter what the weight values are... they will have the same grads as the data is the same.

# Now, let's try making all values in the column the same (all rows are duplicates)

In [111]:
# Define the fixed values for the features
fixed_values = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32)

# Create a 2D array with 10000 rows and 5 features, all set to the fixed values
X = torch.tile(fixed_values, (10000, 1))

In [112]:
X

tensor([[1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        ...,
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.],
        [1., 2., 3., 4., 5.]])

In [113]:
# STEP - 1: *CONSTANT WEIGHTS*
weights = torch.full((n_features, 1), 0.42, requires_grad=True) # fill with constant
bias = torch.tensor(.123, requires_grad=True)

In [114]:
weights, bias

(tensor([[0.4200],
         [0.4200],
         [0.4200],
         [0.4200],
         [0.4200]], requires_grad=True),
 tensor(0.1230, requires_grad=True))

In [115]:
# STEP - 2: forward pass
prediction = torch.matmul(X, weights) + bias

In [116]:
# initial prediction
prediction[:5]

tensor([[6.4230],
        [6.4230],
        [6.4230],
        [6.4230],
        [6.4230]], grad_fn=<SliceBackward0>)

> Again same predictions.

In [117]:
# STEP - 3: Calculate the loss
loss = (y - prediction) ** 2 / 2
loss = loss.mean()

In [118]:
# currently we don't have any grad
type(weights.grad)

NoneType

In [119]:
# STEP - 4: Calculate the grad
loss.backward()

In [120]:
# now we have the grad!
type(weights.grad), weights.grad

(torch.Tensor,
 tensor([[ 5.5253],
         [11.0507],
         [16.5759],
         [22.1013],
         [27.6265]]))

> Because each row will pass same data, but each feature will have different data. So that will make impact here. So, nothing surprising.


# Now maing a row having the same value...

In [125]:
n_rows = 10000

# Create the data where each row has the same value for all features
X = np.array([[i] * 5 for i in range(1, n_rows + 1)])
X

array([[    1,     1,     1,     1,     1],
       [    2,     2,     2,     2,     2],
       [    3,     3,     3,     3,     3],
       ...,
       [ 9998,  9998,  9998,  9998,  9998],
       [ 9999,  9999,  9999,  9999,  9999],
       [10000, 10000, 10000, 10000, 10000]])

In [126]:
X = torch.from_numpy(X).type(torch.float32)

In [128]:
# STEP - 1: *CONSTANT WEIGHTS*
weights = torch.full((n_features, 1), 0.42, requires_grad=True) # fill with constant
bias = torch.tensor(.123, requires_grad=True)

In [129]:
weights, bias

(tensor([[0.4200],
         [0.4200],
         [0.4200],
         [0.4200],
         [0.4200]], requires_grad=True),
 tensor(0.1230, requires_grad=True))

In [130]:
# STEP - 2: forward pass
prediction = torch.matmul(X, weights) + bias

In [131]:
# initial prediction
prediction[:5]

tensor([[ 2.2230],
        [ 4.3230],
        [ 6.4230],
        [ 8.5230],
        [10.6230]], grad_fn=<SliceBackward0>)

> Again same predictions.

In [132]:
# STEP - 3: Calculate the loss
loss = (y - prediction) ** 2 / 2
loss = loss.mean()

In [133]:
# currently we don't have any grad
type(weights.grad)

NoneType

In [134]:
# STEP - 4: Calculate the grad
loss.backward()

In [135]:
# now we have the grad!
type(weights.grad), weights.grad

(torch.Tensor,
 tensor([[70006624.],
         [70006624.],
         [70006624.],
         [70006624.],
         [70006624.]]))

> As expected!? Exactly! Because now we are passing **the same** information in each column. 

# Let's visualize it...

Assume this is the net:

<img src="../images/nn-constant.png">

So... let's have it.

In [18]:
n_features = 3
X = torch.tensor([[10, 5, 1]], dtype=torch.float32)
X

tensor([[10.,  5.,  1.]])

In [23]:
# STEP - 1: *CONSTANT WEIGHTS*
weights_1 = torch.full((n_features, 2), 1.0, requires_grad=True) # fill with constant

In [24]:
weights_1

tensor([[1., 1.],
        [1., 1.],
        [1., 1.]], requires_grad=True)

In [25]:
# STEP - 2.1: forward pass
hidden_output = torch.matmul(X, weights_1)
hidden_output

tensor([[16., 16.]], grad_fn=<MmBackward0>)

> The above are the intermmidiate outputs... the values in the hidden layer.

In [26]:
weights_2 = torch.full((2, 1), 1.0, requires_grad=True)
weights_2

tensor([[1.],
        [1.]], requires_grad=True)

In [27]:
# STEP - 2.2: forward pass
prediction = torch.matmul(hidden_output, weights_2)
prediction

tensor([[32.]], grad_fn=<MmBackward0>)

In [28]:
y = torch.tensor(.42)

In [29]:
# STEP - 3: Calculate the loss
loss = (y - prediction) ** 2 / 2
loss = loss.mean()

In [31]:
# currently we don't have any grad
type(weights_1.grad), type(weights_2.grad) 

(NoneType, NoneType)

In [35]:
# STEP - 4: Calculate the grad
loss.backward()

In [37]:
# now we have the grad!
weights_1.grad, weights_2.grad

(tensor([[315.8000, 315.8000],
         [157.9000, 157.9000],
         [ 31.5800,  31.5800]]),
 tensor([[505.2800],
         [505.2800]]))

> Great!
> This is as expected. All weights will get the same grad according to their respective input.

# <u>KEY SUMMARY</u>:

Imagine you have a neural network with just one hidden layer and two hidden units (neurons). These two neurons take in the same inputs, and they’re both connected to the output. The network is simple, but here’s where symmetry can mess things up:

## 📍 Symmetry Problem
Identical Weights: Let’s say you initialize the weights (parameters) of both hidden units to the same value, like some constant 𝑐. **Because they’re the same**, each hidden neuron will compute exactly the same output since they’re given the same inputs.

**Forward Propagation Outcome**: When you pass inputs through the network, both neurons in the hidden layer will produce the same activation. ***This means they’re not contributing unique information to the output layer***. It’s as if you only had one neuron instead of two.

**Backpropagation Issue**: During training, when you calculate gradients in backpropagation, these gradients will also be identical for each neuron’s weights because their outputs are identical. If you update the weights with these identical gradients, they’ll stay the same, keeping the neurons "stuck" in sync.

**End Result**: The two neurons continue to behave like one—they can’t learn different features or representations from the input. So, the network isn’t using its full potential; it's as if the hidden layer has only one effective neuron.

## Why It’s Bad
When all neurons in a layer behave the same way, **you’re wasting the network’s capacity**. Ideally, each neuron should learn to detect different patterns or features in the data, helping the network perform better.

## How to Fix It
One way to avoid this is to initialize the weights randomly rather than with the same constant. Random initialization "breaks" the symmetry so that neurons have different starting points, allowing them to learn different features over time. Techniques like dropout regularization (which temporarily disables some neurons during training) also help break symmetry, letting the network learn in a more diverse way.

## Key Takeaway
The issue is that identical starting weights cause neurons to behave identically, effectively reducing the network’s complexity. Randomizing weights at the start helps ensure that each neuron can learn independently and contribute to a more expressive model.