# <b>Activation Function</b>


### What is it
***
How the input to a neuron is transformed into its output by introduce <b>non-linearity</b> into the model, enabling neural networks to learn complex patterns and relationships within the data.



### About non-linearity
***
Without non-linearity, a neural network would behave like a purely linear system, meaning it could only model linear relationships. This would greatly limit its ability to learn and represent complex patterns or solve real-world problems where relationships are often <b>non-linear</b> (see below quadratic equation for non-linearity).
$$\text{Price} = a \cdot (\text{Size})^2 + b \cdot \text{Size} + c$$

Decision Tree Example:

A decision tree might predict price with rules like:

If Size > 100 sqm and Neighborhood = Premium, then Price = High.

If Size <= 100 sqm and Neighborhood = Standard, then Price = Medium.

This creates non-linear decision boundaries.

Activation functions like <i>ReLU</i>, <i>sigmoid</i>, or <i>tanh</i> break this limitation. They apply a transformation to the input data, creating <b>non-linear decision boundaries.</b>


>Note: A linear relationship is a type of relationship between two variables where one variable changes at a <b>constant rate</b> with respect to the other. This kind of relationship is characterized by a straight-line graph when plotted. An example is slope-intercept
$$\text{Price} = a \cdot \text{Size} + b$$




### Non-linear decision boundaries
***
Boundaries that separate data points of different classes in a non-linear way. Unlike linear decision boundaries (straight lines or planes), non-linear boundaries can take on curved, wavy, or more intricate shapes to better fit complex data distributions (see the image below).

![image.png](attachment:image.png)

### About the activation functions
***
1. <b>Sigmoid</b>
    - Formula: $$f(x) = \frac{1}{1 + e^{-x}}$$
    - Output Range: 0 to 1
    - Use Case: Binary classification problems, often in the output layer to represent probabilities.
    - <b>Pros: </b>
        - Smooth, differentiable function
        - Outputs can be interpreted as probabilities
    - <b>Cons: </b>
        - Saturates for large positive or negative inputs, leading to <i>vanishing gradients</i>.
        - Outputs are not zero-centered, which can slow down learning

2. <b>Tanh (Hyperbolic Tangent) Function</b>
    - Formula: $$f(x) = \frac{e^x - e{-x}}{ex + e^{-x}}$$
    - Output Range: -1 to 1
    - Use Case: Hidden layers in neural networks, especially when data is centered around zero
    - <b>Pros: </b>
        - Zero-centered outputs help with optimization
        - Better than sigmoid for many tasks.
    - <b>Cons: </b>
        - Suffers from the vanishing gradient problem for extreme inputs.
        
3. <b>ReLU (Rectified Linear Unit)</b>
    - Formula: $$f(x) = \max(0, x)$$
    - Output Range: 0 to ∞
    - Use Case: Very common in hidden layers of deep networks due to its simplicity and efficiency
    - <b>Pros: </b>
        - Computationally efficient
        - Reduces likelihood of vanishing gradients.
    - <b>Cons: </b>
        - Can cause dying ReLU problem (outputs stuck at 0 for negative inputs).

4. <b>ReLU (Rectified Linear Unit)</b>
    - Formula: $$f(x) = \max(0, x)$$
    - Output Range: 0 to ∞
    - Use Case: Very common in hidden layers of deep networks due to its simplicity and efficiency
    - <b>Pros: </b>
        - Computationally efficient
        - Reduces likelihood of vanishing gradients.
    - <b>Cons: </b>
        - Can cause dying ReLU problem (outputs stuck at 0 for negative inputs).

5. <b>Softmax</b>
    - Formula: $$f_i(x) = \frac{e{x_i}}{\sum_{j=1}{N} e^{x_j}}$$ (where 𝑖 represents the 𝑖-th class out of 𝑁)
    - Output Range: 0 to 1 (values sum to 1)
    - Use Case: Multi-class classification problems, often in the output layer.
    - <b>Pros: </b>
        - Converts logits into probabilities across multiple classes.
    - <b>Cons: </b>
        - May amplify differences in logits, making predictions too confident.

6. <b>Swish</b>
    - Formula: $$f(x) = x \cdot \text{sigmoid}(x)$$
    - Output Range: -∞ to ∞
    - Use Case: Emerging as an alternative in deep learning models, especially Google’s architectures.
    - <b>Pros: </b>
        - Smooth function
        - Avoids sharp boundaries, leading to better performance in some cases.
    - <b>Cons: </b>
        - Computationally more intensive than ReLU.

7. <b>ELU (Exponential Linear Unit)</b>
    - Formula: $$f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha (e^x - 1) & \text{if } x \leq 0 \end{cases}$$
    - Output Range: −𝛼 to ∞ (where 𝛼 is a constant)
    - Use Case: Used in deeper networks to reduce vanishing gradients
    - <b>Pros: </b>
        - Outputs closer to zero help optimization
    - <b>Cons: </b>
        - Higher computational cost compared to ReLU.

### Zero-centered
***
the output values of the function are distributed around zero, meaning the average (or mean) of the outputs is close to zero. This property is particularly useful for faster and more stable training of neural networks.
-  Why is it important
    1. Improves Gradient Descent
        - Zero-centered activation functions ensure that the gradients during backpropagation don't become biased in a particular direction (e.g., all positive or all negative).
        - leads to better weight updates and more efficient optimization.
    
    2. Balances the Inputs to Next Layers
        - Zero-centered activation functions provide a balanced mix of positive and negative outputs, which stabilizes learning and prevents one part of the network from dominating.

- Examples
    - Tanh Function: Produces outputs in the range −1 to 1, making it zero-centered.
    - Functions like ReLU or sigmoid are not zero-centered, as their outputs are entirely non-negative (e.g., [0,∞)for ReLU and [0,1]for sigmoid).

    -> Note: The sigmoid function is not zero-centered because its output values always lie between 
0
 and 
1
. This means that the range of the outputs is strictly non-negative—there are no negative values.