## 1

An activation function in artificial neural networks is a mathematical function applied to the output of each neuron (or unit) in a neural network. It determines whether a neuron should be activated (i.e., fire) based on the weighted sum of its inputs. Activation functions introduce non-linearity into the network, allowing it to learn complex patterns in the data.

Common activation functions include:

Sigmoid: Outputs values between 0 and 1, often used in the output layer for binary classification tasks.

Tanh: Outputs values between -1 and 1, similar to sigmoid but centered at zero.

ReLU (Rectified Linear Unit): Outputs the input directly if positive, otherwise outputs zero. It's widely used due to its simplicity and effectiveness in training deep neural networks.

## 2

Sigmoid Function (Logistic):

𝜎
(
𝑥
)
=
1
1
+
𝑒
−
𝑥
σ(x)= 
1+e 
−x
 
1
​
 
Outputs values between 0 and 1.
Useful for binary classification tasks in the output layer.
Hyperbolic Tangent Function (Tanh):

tanh
(
𝑥
)
=
𝑒
𝑥
−
𝑒
−
𝑥
𝑒
𝑥
+
𝑒
−
𝑥
tanh(x)= 
e 
x
 +e 
−x
 
e 
x
 −e 
−x
 
​
 
Outputs values between -1 and 1.
Similar to sigmoid but zero-centered, often used in hidden layers.
Rectified Linear Unit (ReLU):

ReLU
(
𝑥
)
=
max
⁡
(
0
,
𝑥
)
ReLU(x)=max(0,x)
Outputs the input if it's positive, otherwise outputs zero.
Simple and effective, commonly used in hidden layers.
Leaky ReLU:

Leaky ReLU
(
𝑥
)
=
{
𝑥
,
if 
𝑥
>
0
𝛼
𝑥
,
otherwise
Leaky ReLU(x)={ 
x,
αx,
​
  
if x>0
otherwise
​
 
where 
𝛼
α is a small constant (e.g., 0.01).

Addresses the "dying ReLU" problem where neurons could become inactive during training.

## 3

Non-linearity: Activation functions introduce non-linearity into the network, allowing it to learn and approximate complex mappings between inputs and outputs. Without non-linear activation functions, a neural network would behave like a linear model, unable to capture intricate patterns in data.

Gradient Propagation: During backpropagation, which is used to update weights in a neural network, the derivative of the activation function determines how error gradients are propagated through the network. Smooth activation functions with well-defined derivatives (like sigmoid, tanh, and ReLU variants) facilitate stable and efficient gradient propagation, enabling effective learning.

Avoiding Vanishing and Exploding Gradients: Certain activation functions (e.g., sigmoid and tanh) can suffer from vanishing gradient problems, where gradients become very small as activations move away from the origin, leading to slower learning or stagnant training. On the other hand, activation functions like ReLU and its variants mitigate this issue by allowing for more effective gradient flow, which helps in faster convergence during training.



## 4

Working of Sigmoid Activation Function:
Range: The sigmoid function outputs values between 0 and 1, which can be interpreted as probabilities. Specifically, 
𝜎
(
𝑥
)
σ(x) tends towards 1 as 
𝑥
x tends to infinity, and towards 0 as 
𝑥
x tends to negative infinity.

Smoothness: It is a smooth function that is continuously differentiable, which allows for gradient-based optimization methods like gradient descent to be applied during training.

Output Interpretation: In neural networks, the sigmoid function is often used in the output layer for binary classification tasks. The output can be interpreted as the probability of a sample belonging to a particular class.

Advantages of Sigmoid Activation Function:
Output Interpretation: Outputs can be interpreted as probabilities, which is useful for binary classification tasks where the network needs to predict probabilities of class membership.

Smooth Gradient: The derivative of the sigmoid function is straightforward and is expressed in terms of the function itself, making gradient calculations easy during backpropagation.

Disadvantages of Sigmoid Activation Function:
Vanishing Gradient: Sigmoid activations saturate and flatten when inputs are very large or very small, leading to vanishing gradients. This can slow down the learning process, especially in deep networks.

Not Zero-Centered: The outputs of the sigmoid function are not zero-centered (they range from 0 to 1), which can make optimization trickier compared to zero-centered functions like ReLU.

Output Saturation: Sigmoid activations can lead to saturation of neurons, where neurons stop learning completely because they are stuck in the saturated regime (output close to 0 or 1), especially during the backpropagation of gradients.

Not Suitable for Hidden Layers: Due to the vanishing gradient problem and issues with saturation, sigmoid activations are less commonly used in hidden layers of deep neural networks compared to ReLU and its variants.

## 5

The Rectified Linear Unit (ReLU) activation function is a non-linear function widely used in deep learning and neural networks. It is defined as:

ReLU
(
𝑥
)
=
max
⁡
(
0
,
𝑥
)
ReLU(x)=max(0,x)

Differences from Sigmoid Function:

Output Range: Sigmoid function outputs values between 0 and 1, while ReLU outputs values between 0 and infinity for positive inputs.

Linearity vs. Non-Linearity: Sigmoid function is non-linear throughout its range, whereas ReLU is piecewise linear (linear for 
𝑥
>
0
x>0 and zero for 
𝑥
≤
0
x≤0).

Gradient Behavior: Sigmoid function has a smooth derivative and does not abruptly cut off gradients, but it suffers from vanishing gradients for extreme inputs. ReLU, on the other hand, has a derivative of 1 for 
𝑥
>
0
x>0 and zero for 
𝑥
≤
0
x≤0, which can lead to more efficient gradient propagation in deep networks.

## 6

Avoids Vanishing Gradient Problem:

Sigmoid activations can lead to vanishing gradients, especially for large or small input values, which slows down or prevents effective learning in deep networks. ReLU, on the other hand, does not suffer from this issue for positive inputs, as its gradient remains constant (1) for 
𝑥
>
0
x>0.
Faster Convergence:

ReLU typically leads to faster convergence during training compared to sigmoid. This is because ReLU's linear nature for positive inputs allows gradients to flow more freely through the network, facilitating quicker updates to network weights.
Sparse Activation:

ReLU promotes sparsity by zeroing out negative inputs. This sparsity can lead to more efficient computation and memory usage in neural networks, as fewer neurons are activated at any given time.
Computational Efficiency:

ReLU is computationally more efficient than sigmoid and other activation functions that involve complex mathematical operations (e.g., exponentials). ReLU involves simple operations like comparison and max function, making it faster to compute.


## 7

"Leaky ReLU" is a variant of the Rectified Linear Unit (ReLU) activation function, which is commonly used in neural networks. Unlike the standard ReLU function, which sets all negative values to zero, Leaky ReLU allows a small, positive gradient (alpha * x where alpha is a small constant, typically 0.01) for negative inputs.

Here's how Leaky ReLU addresses the vanishing gradient problem:

Vanishing Gradient Problem: In deep neural networks during backpropagation, gradients can become very small (approaching zero) as they propagate backward through layers. This can hinder the training process, as small gradients lead to slow learning or no learning at all.

Leaky ReLU Solution: By allowing a small, non-zero gradient for negative inputs (unlike ReLU which sets them to zero), Leaky ReLU helps mitigate the vanishing gradient problem. This ensures that neurons in the network continue to receive updates even for negative inputs, which can help maintain and propagate gradients backward through the network during training.

Advantages:

Prevents Dead Neurons: Neurons in ReLU can sometimes become "dead" if they consistently output zero (especially during large learning rates or due to negative bias). Leaky ReLU prevents this by allowing some gradient flow.
Stable Learning: It provides a more stable gradient and prevents the problem of gradients approaching zero, thus aiding in faster and more consistent learning.

## 7

The softmax activation function is primarily used in multi-class classification problems where the goal is to output probabilities that sum up to 1 across all classes. Here's a breakdown of its purpose and common usage:

Purpose:

Probability Distribution: Softmax converts logits (raw predictions) into probabilities that represent the likelihood of each class being the correct one.
Output Interpretation: It ensures that the output values are non-negative and sum up to 1, making it suitable for interpreting the outputs as probabilities.
Usage:

Multi-class Classification: Softmax is commonly used as the final activation function in neural networks for multi-class classification tasks.
Output Layer: It is typically applied on the output layer of a neural network when the network is trained to predict multiple mutually exclusive classes.
Loss Calculation: Softmax is often paired with the categorical cross-entropy loss function to compute the loss based on the predicted probabilities compared to the true class labels

## 8

The hyperbolic tangent (tanh) activation function is another popular choice in neural networks, similar to the sigmoid function but with some distinct characteristics. Here's an overview of tanh and its comparison to the sigmoid function:

Definition:

The tanh function is defined as:

tanh(x)= 
e 
x
 +e 
−x
 /
e 
x
 −e 
−x
 
​
 
It outputs values between -1 and 1, mapping any real-valued input to this range.
Properties:

Range: Outputs of tanh are in the interval [-1, 1], which means it is zero-centered (unlike sigmoid, which is centered around 0.5).
Symmetry: The function is symmetric around the origin, i.e., 
tanh
(
−
𝑥
)
=
−
tanh
(
𝑥
)
tanh(−x)=−tanh(x).
Smoothness: tanh is smooth and differentiable everywhere, similar to sigmoid.
Comparison with Sigmoid:

Range: Sigmoid outputs values between 0 and 1, making it suitable for binary classification tasks where the output represents a probability.
Zero-Centered: tanh is zero-centered, which can aid in faster convergence of the gradient descent algorithm during training, especially for deep neural networks.
Output Scaling: Outputs from tanh are scaled between -1 and 1, potentially leading to stronger gradients compared to sigmoid, which can help mitigate the vanishing gradient problem to some extent.
Practical Usage: tanh is often used in hidden layers of neural networks for tasks like image recognition and natural language processing, where inputs and outputs are not binary but range over a broader spectrum.