<h1><center>Activation Function in Deep Learning</center></h1>



<h2>Q1. Explain the role of activation functions in neural networks. Compare and contrast linear and nonlinear
activation functions. Why are nonlinear activation functions preferred in hidden layers
</h2>

<p><b>Answer: -</b></p>

<em>An activation function in Deep Learning (DL) is a mathematical function applied to the output of a neuron in a neural network to introduce non-linearity. This non-linearity is crucial because it allows the neural network to learn complex patterns and relationships in data, enabling it to solve more complicated tasks like image recognition, natural language processing, and others.

Activation functions determine if a neuron should be activated or not by transforming the weighted sum of inputs (linear combination) into an output that can be passed on to the next layer.</em>

Importance of Activation Functions
- Non-linearity: Activation functions allow neural networks to approximate complex, non-linear mappings between inputs and outputs.
- Learning Capacity: Without activation functions, the network would be equivalent to a linear regression model, limiting its learning capacity.
- Gradient Flow: They impact how gradients flow through the network during backpropagation, affecting the network’s ability to learn effectively.

<h2>Comparison of Linear and Nonlinear Activation Functions</h2>

<table border="1" cellpadding="10" cellspacing="0">
    <tr>
        <th>Criteria</th>
        <th>Linear Activation Function</th>
        <th>Nonlinear Activation Function</th>
    </tr>
    <tr>
        <td><strong>Definition</strong></td>
        <td>Output is directly proportional to input, represented as ( f(x) = x ).</td>
        <td>Applies a nonlinear transformation, allowing complex relationships to be learned.</td>
    </tr>
    <tr>
        <td><strong>Characteristics</strong></td>
        <td>Output remains a simple linear function of input.</td>
        <td>Enables the model to capture intricate features with different types of transformations.</td>
    </tr>
    <tr>
        <td><strong>Limitations</strong></td>
        <td>Lacks the ability to capture complex data patterns. Multiple linear layers act as a single linear layer.</td>
        <td>Provides flexibility for the model to approximate complex functions but may introduce vanishing gradient issues.</td>
    </tr>
    <tr>
        <td><strong>Use Case</strong></td>
        <td>Commonly used in output layers for linear regression or continuous outputs.</td>
        <td>Preferred in hidden layers to enable the network to learn non-linear patterns.</td>
    </tr>
    <tr>
        <td><strong>Stacking Effect</strong></td>
        <td>Stacking multiple linear layers results in an equivalent single linear layer.</td>
        <td>Stacking layers with nonlinear functions allows the model to build hierarchical features.</td>
    </tr>
</table>

<h3>Why Nonlinear Activation Functions Are Preferred in Hidden Layers</h3>
Nonlinear functions allow hidden layers to capture complex relationships in data, enabling the network to approximate any function, not just linear mappings. Nonlinear activation functions add essential flexibility, enabling hidden layers to learn from complex data patterns.



<h3>Q2. Describe the Sigmoid activation function. What are its characteristics, and in what type of layers is it
commonly used? Explain the Rectified Linear Unit (ReLU) activation function. Discuss its advantages
and potential challenges. What is the purpose of the Tanh activation function? How does it differ from
the Sigmoid activation function
</h3>

<p><b>Answer</b></p>
<h4 style="color:green;">Sigmoid Activation Function</h4>
The Sigmoid activation function is defined by the formula:

- f(x) = 1/1+e<sup>-x</sup>

It produces an S-shaped curve, which maps any real-valued number into a range between 0 and 1.

- Characteristics:

    - Range: 0 to 1
    - Output Interpretation: The output can be interpreted as a probability, making it ideal for binary classification.
    - Non-linearity: It introduces non-linearity, allowing the network to learn complex relationships.
    - Smoothness: The function is smooth and differentiable, which makes it useful for backpropagation.
- Common Usage:

    - Sigmoid is commonly used in the output layer of binary classification models since it outputs values between 0 and 1.
In hidden layers, Sigmoid is less commonly used due to certain limitations like the vanishing gradient problem.

<h4 style="color:green;">Rectified Linear Unit (ReLU) Activation Function</h4>
The ReLU activation function is defined as:

- f(x)=max(0,x)

This means that it outputs zero for any negative input and outputs the input itself for any positive input.

- Characteristics:

    - Range: 0 to infinity for positive inputs
    - Non-linearity: Allows the network to capture complex features in data by adding non-linearity.
    - Sparse Activation: ReLU results in sparse activations since it outputs zero for half of the input values (negative values).
- Advantages:

    - Efficient Computation: ReLU is computationally efficient because it only involves a simple thresholding.
    - Reduced Vanishing Gradient: Unlike Sigmoid and Tanh, ReLU helps alleviate the vanishing gradient problem, allowing gradients to propagate through deep networks more effectively.
    - Promotes Sparse Representations: Many neurons in ReLU networks output zero, making the network computationally efficient.
- Potential Challenges:

    - Dying ReLU Problem: Neurons can "die" during training, especially with a high learning rate, if they consistently output zero for negative values, preventing those neurons from updating.
    - Not Suitable for All Models: ReLU can be unstable for models with noisy data or for generative tasks where Tanh or Sigmoid may perform better.
- Common Usage:

    - ReLU is the default activation function in hidden layers for most deep learning models, especially in convolutional neural networks (CNNs) and other architectures requiring efficient computation

<h4 style="color:green;">Tanh Activation Function</h4>    
The Tanh activation function, short for hyperbolic tangent, is defined as:

- f(x) = e<sup>x</sup> - e<sup>-x</sup>/e<sup>x</sup> + e<sup>-x</sup>

This function produces an S-shaped curve that maps input values into a range between -1 and 1.

- Characteristics:

    - Range: -1 to 1
    - Output Interpretation: Centered around zero, meaning that the output can be both positive and negative, which is advantageous for certain learning tasks.
    - Non-linearity: Like Sigmoid, Tanh is non-linear and differentiable, helping in capturing complex relationships.
- Purpose and Advantages:

    - Centered Output: The zero-centered output is beneficial for many algorithms as it enables faster convergence during training compared to Sigmoid.
    - Strong Gradient: Tanh has a steeper gradient than Sigmoid, allowing for stronger updates, especially useful in recurrent neural networks (RNNs) where Tanh is commonly used.
- Differences from Sigmoid:

    - Range: Sigmoid outputs between 0 and 1, while Tanh outputs between -1 and 1, which helps with faster convergence.
    - Symmetry: Tanh is symmetric around zero, making it suitable for tasks where inputs can be negative, unlike Sigmoid which may be better for probabilities or strictly positive interpretations.
- Common Usage:

    - Tanh is frequently used in hidden layers of RNNs and sometimes in other types of neural networks when zero-centered outputs are beneficial for faster training.

<h3>Q3. Discuss the significance of activation functions in the hidden layers of a neural network.</h3>
<p><b>Answer.</b></p>
Activation functions in the hidden layers of a neural network are essential for enabling the network to learn complex patterns and perform well on non-linear tasks. Here’s a breakdown of their significance:

- Activation functions add non-linearity to the network, allowing it to learn and model complex patterns that a simple linear function cannot.
- Without non-linear activation functions, the entire network would act as a single linear transformation, no matter how many layers it has. This would significantly limit the network's ability to capture intricate relationships in data.
- Activation functions influence how gradients propagate back through the network during backpropagation, which is essential for updating weights.
- Some activation functions, like ReLU, mitigate issues like the vanishing gradient problem, allowing gradients to flow more effectively, which is crucial in training deeper networks.
- Certain activation functions, such as ReLU, output zero for negative values, which introduces sparsity into the network. Sparsity can act as a form of regularization, reducing overfitting by making the network focus on essential features rather than every input detail.

<h3>Q4. Explain the choice of activation functions for different types of problems (e.g., classification,
regression) in the output layer.</h3>
<p><b>Answer: -</b></p>
<table border="1" cellpadding="10" cellspacing="0">
    <tr>
        <th>Problem Type</th>
        <th>Common Activation Function</th>
        <th>Reason for Choice</th>
    </tr>
    <tr>
        <td><strong>Binary Classification</strong></td>
        <td>Sigmoid</td>
        <td>Outputs a probability value between 0 and 1, making it ideal for binary classification tasks where we need a probability or binary output.</td>
    </tr>
    <tr>
        <td><strong>Multi-class Classification</strong></td>
        <td>Softmax</td>
        <td>Softmax provides a probability distribution across multiple classes, ensuring that the sum of probabilities for all classes is 1. It is ideal for tasks with multiple classes.</td>
    </tr>
    <tr>
        <td><strong>Regression (Single Output)</strong></td>
        <td>None or Linear</td>
        <td>No activation (or a linear function) is used for regression tasks with continuous outputs, as it allows unrestricted output values suitable for regression.</td>
    </tr>
    <tr>
        <td><strong>Regression (Multiple Outputs)</strong></td>
        <td>None or Linear</td>
        <td>Similar to single-output regression, no activation function is used to allow the network to output any real-valued numbers for each target variable.</td>
    </tr>
    <tr>
        <td><strong>Multi-label Classification</strong></td>
        <td>Sigmoid (per label)</td>
        <td>Sigmoid is applied independently to each label, outputting a probability for each. This is useful when each sample can belong to multiple labels.</td>
    </tr>
    <tr>
        <td><strong>Ordinal Regression</strong></td>
        <td>Softmax (with ordered classes)</td>
        <td>Softmax may be used to model probabilities of ordered classes, but with specific loss functions that handle order in the outputs.</td>
    </tr>
</table>

<h3>Q5.  Experiment with different activation functions (e.g., ReLU, Sigmoid, Tanh) in a simple neural network
architecture. Compare their effects on convergence and performance.</h3>
<p><b>Answer: -</b></p>

In [None]:
import tensorflow as tf 
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.datasets import mnist 
from keras.utils import to_categorical
import time
import warnings
warnings.filterwarnings('ignore')

x_train, y_train, x_test, y_test = mnist.load_data()

x_train, y_train = x_train/255.0 , x_test/255.0 # Normalize the data
y_train, y_test = to_categorical(y_train), to_categorical(y_test)

def build_and_train_model(activation_function:str):
    model=Sequential([
        Flatten(input_shape=(28,28)),
        Dense(128, activation=activation_function),
        Dense(64, activation=activation_function),
        Dense(10, activation='softmax')
        # softmax is used for multi-class classification problem.
    ])
    
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    
    start_time=time.time()
    history=model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test), verbose=0)
    training_time = time.time()-start_time
    
    test_loss, test_accuracy = model.evaluate(x_test, y_test, verbose=0)
    
    return training_time, test_accuracy, history


# Experiment with ReLU, Sigmoid, and Tanh activation functions

activation_functions=['relu', 'sigmoid', 'tanh']
results={}

for activation in activation_functions:
    print(f"Training with {activation} activation function...")
    training_time, test_accuracy, history = build_and_train_model(activation)
    results[activation] = {
        'Training Time': training_time,
        'Test Accuracy': test_accuracy,
        'History': history.history
    }
    print(f"{activation.capitalize()} - Training Time: {training_time:.2f}s, Test Accuracy: {test_accuracy:.4f}\n")

# Display results summary
for activation, metrics in results.items():
    print(f"Activation Function: {activation.capitalize()}")
    print(f"Training Time: {metrics['Training Time']:.2f}s")
    print(f"Test Accuracy: {metrics['Test Accuracy']:.4f}\n")

# Visualize convergence over epochs
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 8))
for activation in activation_functions:
    plt.plot(results[activation]['History']['val_accuracy'], label=f"{activation.capitalize()}")

plt.title("Validation Accuracy per Epoch for Different Activation Functions")
plt.xlabel("Epoch")
plt.ylabel("Validation Accuracy")
plt.legend()
plt.show()