## Activation functions



### **Activation Functions: What & Why?**  
1. **Activation functions** introduce **non-linearity** into a neural network, enabling it to learn **complex patterns**. 
2. They determine whether a neuron should be **activated** based on its weighted input.  

---

## **🔹 Why Are Activation Functions Important?**  
✅ **Introduce non-linearity** → Allow neural networks to model complex data.  
✅ **Enable deep learning** → Without activation functions, multiple layers behave like a **single-layer perceptron** (just a linear transformation, i.e. $Z = WX + b$).  
✅ **Help control gradient flow** → Prevent vanishing/exploding gradients.  

---

## **🔹 Types of Activation Functions**
### **Comparison of Activation Functions**  

| **Activation Function** | **Formula** | **Pros ✅** | **Cons ❌** | **Common Uses** |
|------------------|------------------|------------|------------|---------------|
| **Linear (Identity Function)** | $ f(x) = x $ | ✅ Used in **regression tasks** (output layer). | ❌ **No non-linearity** → Cannot learn complex patterns. | Regression models |
| **Step Function (Threshold Activation)** | $ f(x) = \begin{cases} 1, & x \geq 0 \\ 0, & x < 0 \end{cases} $ | ✅ Used in **Perceptrons** for binary classification. | ❌ **Not differentiable**, so **cannot be used in gradient-based learning (e.g., backpropagation).** | Perceptrons (historical use) |
| **Sigmoid (Logistic Activation)** | $ f(x) = \frac{1}{1 + e^{-x}} $ | ✅ **Smooth**, outputs between **0 and 1** (useful for probability). <br> ✅ Used in **binary classification**. | ❌ **Vanishing gradient problem** → Large inputs cause gradients to be **very small**, slowing learning. | Logistic Regression, Binary Classification |
| **Tanh (Hyperbolic Tangent)** | $ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $ | ✅ **Zero-centered** output (-1 to 1), better than Sigmoid. <br> ✅ Used in **RNNs**, where balanced gradients are needed. | ❌ Still suffers from the **vanishing gradient problem**. | Recurrent Neural Networks (RNNs) |
| **ReLU (Rectified Linear Unit)** | $ f(x) = \begin{cases} x, & x > 0 \\ 0, & x \leq 0 \end{cases} $ | ✅ **Efficient** → Only requires `max(0, x)`. <br> ✅ **Solves vanishing gradient** (for positive inputs). <br> ✅ Used in **CNNs, Deep Networks**. | ❌ **Dying ReLU Problem** → Neurons can get stuck at **0** if weights are poorly initialized. | Deep Neural Networks (CNNs, MLPs) |
| **Leaky ReLU (Improved ReLU)** | $ f(x) = \begin{cases} x, & x > 0 \\ 0.01x, & x \leq 0 \end{cases} $ | ✅ Solves **Dying ReLU** issue by allowing small negative values. <br> ✅ Used in **deep learning architectures**. | ❌ Small negative slope is a hyperparameter that needs tuning. | Deep Learning Architectures |
| **Softmax (Multi-Class Activation)** | $ f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}} $ | ✅ Used for **multi-class classification (last layer in classifiers like CNNs).** <br> ✅ Outputs **probabilities that sum to 1**. | ❌ Computationally expensive due to exponentials. | Multi-Class Classification (CNNs, NLP Models) |

---

### **🚀 Final Takeaway**
- **ReLU** → Best for **hidden layers** in deep networks.  
- **Softmax** → Best for **multi-class classification** (last layer).  
- **Sigmoid/Tanh** → Used in **binary classification & RNNs**, but may cause **vanishing gradients**.  
- **Leaky ReLU** → Fixes ReLU’s **dying neuron problem**.  


## **🔹 Choosing the Right Activation Function**
| **Use Case** | **Best Activation** |
|-------------|------------------|
| **Binary Classification** | Sigmoid (last layer) |
| **Multi-Class Classification** | Softmax (last layer) |
| **Hidden Layers in Deep Networks** | ReLU / Leaky ReLU |
| **RNNs (Sequence Models)** | Tanh / Leaky ReLU |
| **Regression Output** | Linear (No activation) |


### Sigmoid implementation:

In [3]:
# In numpy
import numpy as np

def sigmoid(x):
    return 1/ (1 + np.exp(-x))

x = np.array([-2, -1, 0, 1, 2])
print("Sigmoid Output:", sigmoid(x))

Sigmoid Output: [0.11920292 0.26894142 0.5        0.73105858 0.88079708]


In [None]:
#
import torch

def sigmoid_torch(x):
    return 1/ (1 + torch.exp(-x))

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
print("Custom Sigmoid Output:", sigmoid_torch(x))

# Using built-in PyTorch function
print("Torch Sigmoid Output:", torch.sigmoid(x)) # torch.sigmoid(x) is optimized for NNs

### Softmax implementation:

In [4]:
# In numpy
import numpy as np

def softmax(x):
    return np.exp(x)/ (np.sum(np.exp(x)))

x = np.array([2.0, 1.0, 0.1])
print("Softmax Output:", softmax(x))

Softmax Output: [0.65900114 0.24243297 0.09856589]


In [1]:
import torch

def softmax_torch(x):
    return torch.exp(x)/ (torch.sum(torch.exp(x)))

x = torch.tensor([2.0, 1.0, 0.1])
print("Softmax Output:", softmax_torch(x))

print("Inbuilt softmax Output:", torch.softmax(x, dim=0))

Softmax Output: tensor([0.6590, 0.2424, 0.0986])
Inbuilt softmax Output: tensor([0.6590, 0.2424, 0.0986])
