 1)

Activation functions determine whether a neuron should be activated based on its input. They introduce non-linearity into the model, enabling neural networks to learn and model complex data patterns.

Linear Activation Functions:

These perform a linear transformation

Limitation: Linear functions cannot model complex data since they do not introduce non-linearity. Regardless of the number of layers, the output is a linear combination of inputs.
Nonlinear Activation Functions:

Examples include ReLU, Sigmoid, and Tanh.

They enable the model to approximate complex functions and solve problems like image classification or natural language processing.

Advantage: Nonlinearity allows deep networks to learn hierarchical features.

Hidden layers are responsible for extracting features from data. Without nonlinearity, the networks capability to model complex relationships is limited. Nonlinear functions allow multiple layers to work together in extracting and learning intricate patterns.

 2)

**Sigmoid Activation Function**

The Sigmoid activation function is defined as:

f(x)= 1/1+e −x

Characteristics:

Range: The output of the sigmoid function lies between
0
0 and
1
1.

Shape: It has an "S-shaped" curve (also called a logistic curve).

Gradient: The derivative is largest near
𝑥
=
0
x=0, but it decreases rapidly as
𝑥
x moves away from 0.

Common Usage:

Binary Classification: Frequently used in the output layer of binary classification tasks, where the output is interpreted as a probability (e.g., spam vs. not spam).

Rarely used in hidden layers due to the vanishing gradient problem, which makes training deep networks difficult.

**Rectified Linear Unit (ReLU) Activation Function**

The ReLU activation function is defined as:

f(x)=max(0,x)

Characteristics:

Outputs
𝑥
x if
𝑥
>
0
x>0, otherwise outputs
0
0.

Non-linearity: Introduces non-linear behavior, enabling networks to model complex relationships.

Efficiency: ReLU is computationally efficient as it involves simple thresholding.

Advantages:
Avoids Vanishing Gradient: Gradients remain significant for positive inputs, allowing effective backpropagation.

Sparsity: Outputs zero for negative values, resulting in sparse activation, which improves computational efficiency.

Efficient Computation: Simple and fast to compute.

Challenges:

Dying ReLU Problem: If neurons consistently output zero (due to negative weights), they may "die" and stop learning altogether. This occurs because their gradients are always zero.

Not Zero-Centered: The output is non-symmetric around zero, which can slow down optimization.

**Tanh Activation Function**

The Tanh activation function (hyperbolic tangent) is defined as:

f(x) = e^x - e^-x / e^x + e^-x

Characteristics:

Range: Outputs values between
−
1
−1 and
1
1.

Shape: It is also an "S-shaped" curve but symmetric around the origin.

Zero-Centered: Tanh produces outputs with zero mean, which can help with optimization by centering data closer to zero.

Purpose:

Used in hidden layers to normalize inputs to the next layer, especially when data is zero-centered.

Improves learning efficiency compared to Sigmoid because gradients are steeper.
Comparison Between Tanh and Sigmoid

Range:

Sigmoid: Outputs between
0
0 and
1
1.

Tanh: Outputs between
−
1
−1 and
1
1.

Zero-Centered:

Sigmoid: Not zero-centered, which can lead to slower learning.
Tanh: Zero-centered, making it more suitable for zero-centered data.
Gradient Magnitude:

Tanh has steeper gradients than Sigmoid, making it more effective in propagating information during backpropagation.

3)

**Significance of Activation Functions in Hidden Layers**

Activation functions are crucial in hidden layers because:

They allow the network to model nonlinear relationships, making it capable of solving complex tasks like image recognition or natural language processing.

Without them, the network would behave like a linear model, no matter how deep the architecture is.

Activation functions enable the network to learn from errors, adapt weights effectively, and capture important patterns in data.

4)

1. Classification Problems

Classification tasks involve predicting discrete categories or class labels. The activation function ensures the output is in a format suitable for classification (e.g., probabilities).

a) Binary Classification:

Activation Function: Sigmoid

Reason: Sigmoid squashes the output to a range between 0 and 1, making it interpretable as a probability for a single class.

Example: Predicting whether an email is spam or not spam.

Output: A single neuron outputs a probability, which can be thresholded (e.g.,
>
0.5
>0.5 for one class,
≤
0.5
≤0.5 for the other).

b) Multi-Class Classification:

Activation Function: Softmax

Reason: Softmax converts raw scores (logits) into a probability distribution across multiple classes, ensuring the probabilities sum to 1.

Example: Classifying images into categories like "cat," "dog," or "bird."

Output: Each output neuron corresponds to a class, and the neuron with the highest probability represents the predicted class.

2. Regression Problems

Regression tasks involve predicting continuous numeric values, such as house prices or stock prices. Here, the output layer must produce raw, unbounded numeric values.

Activation Function: Linear (No Activation)

Reason: A linear activation function outputs raw numeric values without any transformation, which is ideal for continuous data.

Example: Predicting house prices, where the output is an exact dollar value.

Output: The output neuron directly represents the predicted value.

In [1]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate synthetic data
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1)
y = (y - y.min()) / (y.max() - y.min())  # Normalize target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale input data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Function to build and compile models
def build_model(activation):
    model = Sequential([
        Dense(32, activation=activation, input_shape=(X_train.shape[1],)),
        Dense(16, activation=activation),
        Dense(1, activation='linear')  # Output layer for regression
    ])
    model.compile(optimizer=Adam(learning_rate=0.01), loss='mse', metrics=['mae'])
    return model

# Train and compare models
activations = ['relu', 'sigmoid', 'tanh']
for act in activations:
    model = build_model(act)
    print(f"\nTraining model with {act} activation:")
    model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1, validation_data=(X_test, y_test))
    loss, mae = model.evaluate(X_test, y_test, verbose=0)
    print(f"{act.capitalize()} Activation - Test Loss: {loss:.4f}, Test MAE: {mae:.4f}")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)



Training model with relu activation:
Epoch 1/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 32ms/step - loss: 0.1780 - mae: 0.3142 - val_loss: 0.0226 - val_mae: 0.1198
Epoch 2/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - loss: 0.0237 - mae: 0.1128 - val_loss: 0.0108 - val_mae: 0.0810
Epoch 3/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - loss: 0.0076 - mae: 0.0698 - val_loss: 0.0064 - val_mae: 0.0650
Epoch 4/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 0.0046 - mae: 0.0523 - val_loss: 0.0042 - val_mae: 0.0512
Epoch 5/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - loss: 0.0032 - mae: 0.0449 - val_loss: 0.0031 - val_mae: 0.0437
Epoch 6/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0022 - mae: 0.0366 - val_loss: 0.0025 - val_mae: 0.0394
Epoch 7/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m

5)

 After comparing their effects on convergence and performance

ReLU: Fast convergence, generally better for deep networks.

Sigmoid: Slower convergence due to vanishing gradients.

Tanh: Performs better than Sigmoid but still slower than ReLU.