Why Use Activation Functions in Hidden Layers?
In a neural network, each node in a hidden layer takes inputs from the previous layer, processes them, and passes the result forward. This processing starts with a linear transformation—a weighted sum of the inputs (plus a bias sometimes). Without an activation function, the network would only perform linear operations. Since stacking multiple linear layers still results in a single linear function, the network couldn’t model complex, non-linear relationships—like those in real-world tasks such as image or speech recognition.

Activation functions add non-linearity to this process. By applying a non-linear operation to the linear transformation’s output, they enable the network to learn and approximate intricate patterns, making them essential for hidden layers.

Common Activation Functions and Their Use Cases
Here are the most widely used activation functions for hidden layers, along with their roles:

Sigmoid

Formula: $ \sigma(x) = \frac{1}{1 + e^{-x}} $
Range: (0, 1)
Use Case: Rarely used in hidden layers due to the vanishing gradient problem (gradients become tiny, slowing learning). It’s more common in output layers for binary classification.


Hyperbolic Tangent (Tanh)

Formula: $ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $
Range: (-1, 1)
Use Case: Used in hidden layers as it centers data (outputs range from -1 to 1), but it also suffers from vanishing gradients.


Rectified Linear Unit (ReLU)

Formula: $ \text{ReLU}(x) = \max(0, x) $
Range: [0, ∞)
Use Case: The go-to choice for hidden layers. It’s fast and reduces vanishing gradient issues, though it can lead to “dead neurons” (stuck at zero) if inputs are consistently negative.


Leaky ReLU

Formula: $ \text{Leaky ReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{otherwise} \end{cases} $ (where $ \alpha $ is small, e.g., 0.01)
Range: (-∞, ∞)
Use Case: A fix for ReLU’s dead neuron problem, allowing a small gradient for negative inputs. Great for hidden layers.


Softmax

Formula: $ \text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}} $
Range: (0, 1), sums to 1
Use Case: Not typically for hidden layers; it’s used in output layers for multi-class classification.