# 🔹 Tanh (Hyperbolic Tangent) Activation Function

## 1. Theory

The **Tanh Activation Function** (Hyperbolic Tangent) is a non-linear activation that maps input values into the range **(-1, 1)**.  
It is essentially a scaled version of the sigmoid function, providing outputs centered around zero.

---

### **Mathematical Formula**

$$
\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$

- For large positive \( x \), \( \tanh(x) \to 1 \)
- For large negative \( x \), \( \tanh(x) \to -1 \)
- For \( x = 0 \), \( \tanh(0) = 0 \)

---

### **Derivative**

The derivative of \( \tanh(x) \) is:

$$
\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)
$$

- The derivative is highest at \( x = 0 \) (equals 1)
- For large \( |x| \), the derivative approaches 0, leading to **vanishing gradients**

---
---

## 3. Advantages and Disadvantages

| **Aspect**          | **Advantages**                                                   | **Disadvantages**                                               |
|---------------------|------------------------------------------------------------------|----------------------------------------------------------------|
| **Range**           | Outputs in (-1,1), making it zero-centered and improving learning. | Still suffers from vanishing gradients for large \( |x| \).   |
| **Smoothness**      | Differentiable everywhere, supports gradient-based learning.     | Slightly more computationally expensive than ReLU.            |
| **Non-linearity**   | Allows networks to learn complex patterns.                       | Training may still be slower than with ReLU.                  |

---

## 4. Use Cases

- Used in **hidden layers** of neural networks (before ReLU became popular).
- Frequently applied in **RNNs (Recurrent Neural Networks)** for internal state representation.
- Used in **autoencoders** and **MLPs** when zero-centered activation is beneficial.

---

## 5. Sigmoid vs Tanh vs ReLU (Quick Comparison)

| **Feature**          | **Sigmoid**         | **Tanh**            | **ReLU**               |
|----------------------|----------------------|----------------------|------------------------|
| **Range**            | (0,1)               | (-1,1)              | (0, ∞)                 |
| **Zero-Centered?**   | ❌ No                | ✅ Yes               | ✅ Yes (for positive inputs) |
| **Vanishing Gradient?**| ✅ Yes             | ✅ Yes               | ❌ Less likely         |
| **Computation**      | Expensive (exp)     | Expensive (exp)     | Cheap                  |
| **Preferred Use**    | Output (binary prob)| Hidden layers (older)| Hidden layers (modern) |

---

## 6. Interview Questions and Answers

### **Q1: Why is Tanh preferred over Sigmoid in hidden layers?**
**Answer:**  
- Tanh outputs values in **(-1,1)**, which is **zero-centered**.  
- This reduces bias in gradient updates, leading to **faster convergence** than sigmoid.  
- Sigmoid outputs (0,1), causing non-zero mean activations and slower learning.

---

### **Q2: Does Tanh solve the Vanishing Gradient Problem?**
**Answer:**  
- No, it still suffers from vanishing gradients because its derivative approaches 0 for large \( |x| \).  
- However, it is **less prone** than sigmoid due to its zero-centered nature.

---

### **Q3: In which modern networks is Tanh still used?**
**Answer:**  
- Tanh is still widely used in **RNNs (e.g., LSTM, GRU)**, where bounded activations help control exploding values.  
- It is less common in feedforward deep networks, where **ReLU** dominates.

---

## ✅ Conclusion
The **Tanh** activation is an improvement over sigmoid for hidden layers because it is **zero-centered**, enabling better gradient flow.  
However, it is still susceptible to **vanishing gradients**, so modern architectures often use **ReLU** and its variants.

