**Q1. What is an activation function in the context of artificial neural networks?**

- In artificial neural networks, an activation function is a mathematical operation applied to the output of each neuron. It determines whether the neuron should be activated or not based on whether the neuron's input is relevant for the model's prediction. Activation functions introduce non-linearity into the network, allowing it to learn complex patterns and relationships in the data.

**Q2. What are some common types of activation functions used in neural networks?**

- Common activation functions include:
  1. Sigmoid: Maps the input to a value between 0 and 1, often used in binary classification problems.
  2. Tanh (Hyperbolic Tangent): Similar to sigmoid but maps the input to a value between -1 and 1.
  3. ReLU (Rectified Linear Unit): Sets negative input values to zero and leaves positive values unchanged, widely used in hidden layers.
  4. Leaky ReLU: Similar to ReLU but allows a small, non-zero gradient for negative inputs to prevent neurons from dying.
  5. Softmax: Converts raw scores into probabilities, commonly used in the output layer for multi-class classification tasks.

**Q3. How do activation functions affect the training process and performance of a neural network?**

- Activation functions play a crucial role in the training process and performance of a neural network. They affect the network's convergence speed, its ability to handle different types of data, and its avoidance of issues like vanishing gradients. The choice of activation function impacts the network's ability to learn complex patterns and relationships in the data.

**Q4. How does the sigmoid activation function work? What are its advantages and disadvantages?**

- The sigmoid function squashes the input values to a range between 0 and 1, which can be interpreted as probabilities. Its advantages include smoothness and output interpretation. However, it suffers from the vanishing gradient problem, where gradients become extremely small for extreme input values, leading to slow convergence and saturation of output for large inputs.

```python
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))
```

**Q5. What is the rectified linear unit (ReLU) activation function? How does it differ from the sigmoid function?**

- ReLU sets all negative input values to zero and leaves positive values unchanged. Unlike the sigmoid function, which squashes input values to a specific range, ReLU does not suffer from saturation for large input values and helps alleviate the vanishing gradient problem.

**Q6. What are the benefits of using the ReLU activation function over the sigmoid function?**

- ReLU offers faster convergence due to its non-saturating nature, alleviating the vanishing gradient problem. It is computationally efficient and allows for simpler optimization compared to the sigmoid function.

**Q7. Explain the concept of "leaky ReLU" and how it addresses the vanishing gradient problem.**

- Leaky ReLU is similar to ReLU but allows a small slope for negative inputs, preventing neurons from completely dying out. By introducing a small gradient for negative values, leaky ReLU addresses the vanishing gradient problem and helps in better training of deep neural networks.

```python
def leaky_relu(x, alpha=0.01):
    return np.maximum(alpha * x, x)
```

**Q8. What is the purpose of the softmax activation function? When is it commonly used?**

- Softmax converts raw scores into probabilities, ensuring that the sum of output probabilities is 1. It is commonly used in the output layer of neural networks for multi-class classification tasks, where it helps in predicting the probability distribution over multiple classes.

```python
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
```

**Q9. What is the hyperbolic tangent (tanh) activation function? How does it compare to the sigmoid function?**

- Tanh is similar to the sigmoid function but squashes the input values to a range between -1 and 1. It is zero-centered, which helps mitigate the vanishing gradient problem better than the sigmoid function, especially in deep neural networks.

```python
def tanh(x):
    return np.tanh(x)
```