## Question 1

An activation function is a crucial component in artificial neural networks, serving as the mathematical operation applied to the input of a node or neuron in order to determine its output. The activation function introduces non-linearities to the network, allowing it to model complex relationships in the data.
In a neural network, each node or neuron receives input signals, performs a weighted sum of these inputs, and then applies the activation function to produce an output. This output is then passed on to the next layer of neurons as input. The activation function plays a key role in determining whether a neuron should be activated (i.e., contribute to the output) or not, based on the weighted sum of its inputs.

## Question 2

An activation function is a mathematical operation applied to the input of a node or neuron in order to determine its output. The activation function introduces non-linearities to the network, allowing it to model complex relationships in the data. Common activation functions include :

1. Sigmoid Activation Function :

Maps the input to a range between 0 and 1. It's often used in the output layer of binary classification models.

It is mathematically given as 1/(1+e^-x)


2. Hyperbolic Tangent (tanh) :


Similar to the sigmoid function, but it maps input values to a range between -1 and 1. It is useful in the hidden layers of neural networks and helps mitigate the vanishing gradient problem.

Mathematically given as :
tanh(x) = (e^2x -1)/(e^2x + 1)

3. Rectified Linear Unit (ReLU) :

This activation function outputs the input directly for positive values and zero for negative values. It has become popular due to its simplicity and effectiveness in training deep neural networks.

ReLU(x) = max(0,x)

4. Leaky ReLU :

Leaky ReLU allows a small, non-zero gradient when the input is negative, addressing the "dying ReLU" problem where neurons could become inactive during training.

ReLU(x) = {  x if x > 0,
             kx otherwise
             }
             
5. Parametric ReLU (PReLU) :

PReLU is similar to Leaky ReLU but allows the slope of the negative part to be learned during training.

6. Softmax Function :

The softmax function is often used in the output layer for multi-class classification problems. It converts raw output scores into probabilities, ensuring that the sum of probabilities for all classes is 1.

## Question 3

Activation functions play a crucial role in the training process and performance of a neural network. Here are some ways in which activation functions can impact the network :

1. Non-Linearity and Model Complexity : Activation functions introduce non-linearities to the network. This non-linearity is essential for the network to learn and model complex relationships in the data. Without non-linear activation functions, a neural network would essentially reduce to a linear model, limiting its capacity to represent intricate patterns.

2. Gradient Descent and Backpropagation : During the training process, neural networks use optimization algorithms like gradient descent to adjust the weights and biases to minimize the error. The choice of activation function affects the gradients computed during backpropagation. Smooth and differentiable activation functions are preferable, as they facilitate gradient-based optimization methods. Some activation functions can lead to vanishing or exploding gradient problems. Vanishing gradients occur when the gradients become very small during backpropagation, causing the network to learn very slowly or not at all. Exploding gradients occur when gradients become too large, leading to instability during training. Activation functions like sigmoid and hyperbolic tangent are more prone to vanishing gradients, especially in deep networks. ReLU and its variants help mitigate the vanishing gradient problem.

3. Activation functions that saturate (output very large or very small values) can limit the range of values the network can produce. Sigmoid and hyperbolic tangent, for example, saturate for extreme input values. This can result in poor weight updates and slow convergence. ReLU and its variants, by contrast, do not saturate for positive input values, allowing for a broader range of outputs.


4. ReLU and its variants induce sparsity in the network by setting some neuron activations to zero. This sparsity can lead to more efficient representations and reduce the computational load. However, it's important to consider issues like "dead neurons" (neurons that always output zero) in the case of ReLU, which can occur if a large gradient flows through a ReLU unit during training.

## Question 4

The sigmoid activation function, also known as the logistic function, is a common non-linear activation function used in artificial neural networks. It maps any real-valued number to the range of [0,1] 

The mathematical representation of sigmoid function is :
   1/(1+e^-x)
   
The output of the sigmoid function is always between 0 and 1. This makes it useful in the output layer of a neural network for binary classification problems, where the goal is to produce a probability that the input belongs to a certain class. The sigmoid function is smooth and differentiable everywhere, which is beneficial for gradient-based optimization algorithms like gradient descent. The smoothness allows for the computation of derivatives during backpropagation, enabling the network to learn efficiently.

###### Advantages :

1. The output of the sigmoid function can be interpreted as a probability. This is particularly useful in binary classification problems where the network's output can be seen as the probability of belonging to a particular class.

2. The sigmoid function has a smooth gradient, making it well-suited for gradient-based optimization algorithms. This smoothness facilitates stable and efficient training.


###### Disadvantages :

1. The sigmoid function tends to saturate for extreme input values, causing the gradient to become very small. This can lead to the vanishing gradient problem, where the weights of the network are updated very slowly, hindering learning. This is especially problematic in deep networks.

2. The outputs of the sigmoid function are centered around 0.5 when the input is 0. This can create issues during training, especially when the outputs need to be centered around zero.

3. Sigmoid is not zero-centered, which can be a drawback for some optimization algorithms. Zero-centered activation functions (like ReLU) tend to perform better with certain weight initialization methods.

## Question 5

The Rectified Linear Unit (ReLU) activation function is a non-linear function commonly used in artificial neural networks, especially in the hidden layers. Unlike the sigmoid function, which squashes input values to a range between 0 and 1, ReLU allows positive values to pass through unchanged while setting all negative values to zero. The mathematical form of the ReLU function is:

ReLU(x) = max(0,x)

In other words, if the input x is positive, the output is equal to x; if x is negative, the output is zero. Visually, the ReLU activation function looks like a linear function for positive values and zero for negative values.
The output of ReLU is in the range 0 to infinity because it allows positive values to pass through unchanged and sets negative values to zero.
The output of sigmoid is in the range (0,1), making it useful for binary classification where the output can be interpreted as a probability.

## Question 6

ReLU (Rectified Linear Unit) and sigmoid are both activation functions used in artificial neural networks, but they differ in their mathematical form, output range, and some key properties. ReLU activation can be beneficial over sigmoid in the following ways:


1. The output of ReLU is in the range 0 to infinity  because it allows positive values to pass through unchanged and sets negative values to zero. The output of sigmoid is in the range (0,1), making it useful for binary classification where the output can be interpreted as a probability.

2. ReLU introduces non-linearity to the network, allowing it to model complex relationships in the data. Sigmoid is also non-linear, but its output can be interpreted as probabilities and it squashes input values to a limited range.

3. Helps mitigate the vanishing gradient problem, as the gradient for positive inputs is always 1. Can suffer from the vanishing gradient problem, especially in deep networks, as the gradient becomes very small for extreme input values.

4.  Can introduce sparsity in the network, as some neurons may output zero for certain inputs. Does not introduce sparsity, as its output is always in the range (0, 1).

5. The output is not directly interpretable as a probability, and ReLU is often used in hidden layers for feature learning.

## Question 7

Leaky ReLU is a variant of the Rectified Linear Unit (ReLU) activation function, designed to address some of the limitations of traditional ReLU, particularly the issue of "dead neurons" and the vanishing gradient problem. In Leaky ReLU, a small slope (usually a small positive constant, denoted as α) is assigned to the negative part of the input, allowing it to leak through. The mathematical form of Leaky ReLU is given by:

Leaky ReLU(x) = {x if x>0,

                αx otherwise
                 }
                 
Here α is a small positive constant, typically in the range of 0.01 to 0.3. Unlike traditional ReLU, which sets all negative values to zero, Leaky ReLU allows a small, non-zero gradient for negative inputs. This introduces a level of continuity and prevents neurons from becoming completely inactive during training. 

The vanishing gradient problem occurs when the gradient becomes very small during backpropagation, particularly for deep networks. This can slow down or hinder the learning process. By allowing a small gradient for negative inputs, Leaky ReLU helps alleviate the vanishing gradient problem, especially in situations where ReLU might fail to propagate gradients effectively.

## Question 8

The softmax activation function is commonly used in the output layer of neural networks for multi-class classification problems. Its primary purpose is to convert a vector of raw output scores (also known as logits) into a probability distribution over multiple classes. The softmax function ensures that the sum of the probabilities for all classes is equal to 1.

Key characteristics and purposes of the softmax activation function: 

1. The output of the softmax function can be interpreted as a probability distribution over the classes. Each element in the softmax output represents the probability of the input belonging to the corresponding class.

2. The softmax function normalizes the raw scores, ensuring that the probabilities sum to 1. This is crucial for making the output meaningful as probabilities.

3. Softmax is often paired with the cross-entropy loss function for training neural networks for classification tasks. The cross-entropy loss measures the difference between the predicted probabilities and the true distribution of class labels.

4. Softmax is commonly used in scenarios where there are more than two classes, making it suitable for multi-class classification problems.

## Question 9

The hyperbolic tangent, often abbreviated as tanh, is an activation function commonly used in artificial neural networks. It is similar to the sigmoid activation function but maps the input to a range between -1 and 1. The mathematical form of the tanh function is: 

tanh(x) = (e^2x -1)/(e^2x + 1)

The tanh function shares some properties with the sigmoid function, such as smoothness and differentiability. However, tanh tends to be more sensitive to changes around its center. Tanh is zero-centered, which can be beneficial for optimization algorithms. Sigmoid, being positively centered, may cause issues like weight updates being biased in one direction during training.

The main difference between tanh and sigmoid is the range of output the output of sigmoid lies within the range of (0,1) while for tanh the output ranges between [-1,1]