Is it OK to initialize all the weights to the same value as long as that value is
selected randomly using He initialization?

No, it is not OK to initialize all the weights to the same value, even if that value is selected randomly using He initialization (or any other initialization scheme).


When training a neural network, each neuron must learn different features from the data. If all weights are initialized to the same value, then:

Each neuron in a layer receives the same gradients during backpropagation.
They all get updated identically during training.
This means they will continue to learn the same features — effectively behaving as if there's only one neuron, not many.
This problem is known as the symmetry problem.

Is it OK to initialize the bias terms to 0?

Biases are not involved in symmetry breaking in the same way weights are.

Weights determine how inputs are combined. If they are the same, neurons become indistinguishable — this is bad.
Biases just shift the output of the activation function. Initializing them to zero does not prevent neurons from learning different features, as long as weights are initialized properly (e.g., with He or Xavier initialization).



Imagine this neuron:


z = w⋅x+b
If all biases 

b = 0, the variation in outputs comes from 
w
⋅
x
w⋅x, which is already randomized via proper weight initialization. As learning progresses, the bias values will adapt as needed



In some deep or very sparse networks, initializing biases to small positive values (e.g., 0.01) can help prevent neurons from being inactive (e.g., with ReLU activations). But even then, zero is a reasonable default



Name three advantages of the SELU activation function over ReLU.

1.  Self-Normalizing Behavior
Advantage: SELU keeps the mean and variance of the activations close to 0 and 1, respectively, throughout the network.
Why it matters: This stabilizes and accelerates training by reducing the risk of exploding/vanishing gradients.
ReLU drawback: ReLU doesn’t normalize activations and often requires Batch Normalization to stabilize training.


2.  Non-Zero Mean & Negative Activations
Advantage: SELU outputs both positive and negative values, which helps in maintaining zero-centered activations.
Why it matters: This makes optimization easier and improves convergence.
ReLU drawback: ReLU outputs are always ≥ 0, which can shift the mean of activations and slow down training.


3.  Avoids Dead Neurons
Advantage: SELU is differentiable everywhere, even for negative inputs (due to its exponential part), so neurons are less likely to "die."
Why it matters: Every neuron can still contribute to learning.
ReLU drawback: Neurons can "die" (i.e., get stuck outputting 0 for all inputs) if they receive only negative inputs during training.

In which cases would you want to use each of the following activation functions:
SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

| Activation                   | Best Used In                         | Typical Use Case                   | Output Range  | Notes                              |
| ---------------------------- | ------------------------------------ | ---------------------------------- | ------------- | ---------------------------------- |
| **SELU**                     | Deep dense nets                      | Self-normalizing nets              | \~(-3.6, 3.6) | Use with LeCun init, no BN/dropout |
| **Leaky ReLU / PReLU / ELU** | Deep nets                            | Avoiding dead neurons              | (-∞, ∞)       | Keeps gradients flowing            |
| **ReLU**                     | Most hidden layers (default)         | CNNs, MLPs                         | \[0, ∞)       | Simple and efficient               |
| **tanh**                     | RNNs, shallow nets                   | Centered activations               | \[-1, 1]      | Can still vanish for large inputs  |
| **Sigmoid**                  | Binary classification (output layer) | Probabilistic output               | (0, 1)        | Not used in hidden layers anymore  |
| **Softmax**                  | Multi-class classification           | Output probabilities (multi-class) | (0, 1), sum=1 | Used only in final output layer    |


What may happen if you set the momentum hyperparameter too close to 1 (e.g.,
0.99999) when using an SGD optimizer?

f you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using Stochastic Gradient Descent (SGD with momentum), the optimizer can become unstable or very slow to adapt.

⚠️ Potential Issues:
1. Overshooting and Oscillations

What happens: The momentum term accumulates updates for too long, causing the optimizer to overshoot the minimum and oscillate around it.
Why: High momentum means the algorithm relies heavily on past gradients — even if the current gradient is small or changes direction.
2. Very Slow Convergence

What happens: If the accumulated velocity is too dominant, the optimizer may ignore newer gradient directions and take a long time to correct its course.
Why: It becomes “inertia-heavy,” resisting necessary changes in direction.
3. Divergence

What happens: In some cases, especially with high learning rates or poor initialization, the model’s weights may diverge completely.
Why: Too much momentum acts like pushing a ball too fast downhill — it can skip over the minimum entirely.
🧠 Intuition: What is Momentum?
In SGD with momentum, the update is:

v
t
=
γ
v
t
−
1
+
η
∇
L
(
θ
t
)
v 
t
​	
 =γv 
t−1
​	
 +η∇L(θ 
t
​	
 )
θ
t
+
1
=
θ
t
−
v
t
θ 
t+1
​	
 =θ 
t
​	
 −v 
t
​	
 
γ
γ: momentum (typically ~0.9)
η
η: learning rate
With 
γ
=
0.99999
γ=0.99999, the old velocity term 
v
t
−
1
v 
t−1
​	
  dominates nearly all updates, drowning out new gradients.

| Momentum Value | Effect                                       |
| -------------- | -------------------------------------------- |
| 0.0            | Pure SGD (no momentum) → slow                |
| 0.9 (default)  | Balanced, accelerates in right directions    |
| > 0.95         | Faster convergence, but risk of instability  |
| \~0.99999      | 🚨 Too aggressive — may diverge or oscillate |


Name three ways you can produce a sparse model.

1. 🧹 L1 Regularization (Lasso)
How it works: Adds a penalty to the loss function based on the sum of absolute values of weights:
Loss
+
λ
∑
i
∣
w
i
∣
Loss+λ 
i
∑
​	
 ∣w 
i|


Effect: Drives many weights to exactly zero, producing a sparse model.
Used in: Linear models, logistic regression, neural networks.


2. ✂️ Pruning (Weight or Neuron Pruning)
How it works:
After training, remove weights (or neurons) that are close to zero or have little impact on output.
Can be done manually or with frameworks like TensorFlow Model Optimization Toolkit.
Types:
Unstructured pruning: Individual weights removed.
Structured pruning: Entire filters, channels, or neurons removed.
Used in: Neural networks, especially to reduce size for deployment.


3. 🧬 Sparse Initialization or Architectures
How it works:
Design models with a sparse architecture from the beginning (e.g., use fewer connections).
Or initialize weights with many zeros and allow only important ones to be trained (dynamic sparse training).
Examples:


Lottery Ticket Hypothesis: Find sparse subnetworks that train just as well.
Sparse Transformers: Used in NLP to reduce attention complexity.

| Method                   | Description                         | Common Use Cases                |
| ------------------------ | ----------------------------------- | ------------------------------- |
| **L1 Regularization**    | Penalizes non-zero weights          | Regression, neural networks     |
| **Pruning**              | Removes unimportant weights/neurons | Model compression, deployment   |
| **Sparse Architectures** | Use sparsity by design              | Transformers, large neural nets |


Does dropout slow down training? Does it slow down inference (i.e., making
predictions on new instances)? What about MC Dropout?

✅ 1. Dropout during Training

⏳ Does Dropout slow down training?
Yes, slightly.
Why:
It introduces random masking of neurons during each training step.
Adds a bit of overhead, but modern hardware handles this well.
Tradeoff:
Small computational cost vs. significant regularization benefit (helps prevent overfitting).
✅ 2. Dropout during Inference (Standard Dropout)

🚀 Does Dropout slow down inference?
No. Dropout is disabled during inference in standard usage.
Instead, the model uses the full network but scales weights to account for the dropped neurons during training.
So:
Inference is fast, just like a regular feedforward pass.
No randomness — deterministic output.
🔁 3. MC Dropout (Monte Carlo Dropout)

⏳ Does MC Dropout slow down inference?
Yes, significantly.
Why:
MC Dropout keeps dropout active at inference time to estimate uncertainty.
Multiple forward passes (e.g., 10–100) are made for the same input, each with different dropout masks.
The predictions are averaged (for mean prediction) or used to calculate uncertainty (variance).


📊 Use Case:
When you need uncertainty estimates in predictions.
Common in Bayesian deep learning, medical AI, safety-critical applications.

| Method         | Training Speed  | Inference Speed                          | Stochastic Output at Inference? | Use Case                          |
| -------------- | --------------- | ---------------------------------------- | ------------------------------- | --------------------------------- |
| **Dropout**    | Slightly slower | Fast                                     | ❌ No                            | Regularization during training    |
| **MC Dropout** | Same as Dropout | **Much slower** (due to multiple passes) | ✅ Yes                           | Predictive uncertainty estimation |


In [1]:
import tensorflow as tf
import numpy as np 
import pandas as pd
import keras.backend as k
from tensorflow import keras


In [2]:
from tensorflow.keras.initializers import lecun_normal
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[32,32,3]), 
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()), 
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(100,activation="selu",kernel_initializer=lecun_normal()),
    keras.layers.Dense(10,activation="softmax")

])

  super().__init__(**kwargs)


In [3]:
model.compile(optimizer='adam',loss = 'sparse_categorical_crossentropy',metrics=['sparse_categorical_accuracy'])


In [4]:
model.summary()

In [5]:
from tensorflow.keras.datasets import cifar10
(x_train,y_train),(x_test,y_test) = cifar10.load_data()

In [6]:
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0


In [7]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(
    monitor='val_loss',
    patience = 5,
    restore_best_weights = True
)

In [8]:
history = model.fit(
    x_train,y_train,
    epochs=50,
    validation_split=0.2,
    callbacks=[early_stop],
    batch_size =64
)

Epoch 1/50
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5ms/step - loss: 2.3338 - sparse_categorical_accuracy: 0.1477 - val_loss: 2.0017 - val_sparse_categorical_accuracy: 0.2161
Epoch 2/50
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - loss: 1.9834 - sparse_categorical_accuracy: 0.2494 - val_loss: 1.8951 - val_sparse_categorical_accuracy: 0.2951
Epoch 3/50
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - loss: 1.8913 - sparse_categorical_accuracy: 0.2964 - val_loss: 1.9069 - val_sparse_categorical_accuracy: 0.2917
Epoch 4/50
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - loss: 1.8323 - sparse_categorical_accuracy: 0.3277 - val_loss: 1.8342 - val_sparse_categorical_accuracy: 0.3252
Epoch 5/50
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - loss: 1.7894 - sparse_categorical_accuracy: 0.3482 - val_loss: 1.8205 - val_sparse_categorical_accuracy: 0.3565
