This notebook will compare a neural network that uses activation functions vs one that does not use activation functions

Activation functions are a critical component in neural networks, enabling them to model complex patterns and learn non-linear relationships. They determine the output of a neuron, effectively deciding whether the neuron should be activated or not based on its input.


1. **Sigmoid Function**:
   - **Formula**: 
     \[
     \sigma(x) = \frac{1}{1 + e^{-x}}
     \]
   - **Range**: \( (0, 1) \)
   - **Characteristics**:
     - Converts any input to a value between 0 and 1.
     - Historically used in early neural networks, especially for binary classification problems.
   - **Drawbacks**:
     - **Vanishing Gradient Problem**: Gradients become very small for extreme values of \( x \), leading to slow learning or even stopping the learning process.
     - **Output Not Zero-Centered**: Can cause gradients to have inconsistent signs, which may affect convergence speed.


In [None]:
import torch
import torch.nn as nn

# Example input tensor
x = torch.tensor([-1.0, 0.0, 1.0])

# Using the Sigmoid activation function
sigmoid = nn.Sigmoid()
output = sigmoid(x)

print("Sigmoid Output:", output)


2. **Tanh (Hyperbolic Tangent) Function**:
   - **Formula**: 
     \[
     \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
     \]
   - **Range**: \( (-1, 1) \)
   - **Characteristics**:
     - Similar to the sigmoid function but with outputs ranging between -1 and 1.
     - Zero-centered output, which helps in better convergence during training.
   - **Drawbacks**:
     - Still suffers from the vanishing gradient problem, though to a lesser extent than the sigmoid function.


In [None]:
# Using the Tanh activation function
tanh = nn.Tanh()
output = tanh(x)

print("Tanh Output:", output)



3. **ReLU (Rectified Linear Unit) Function**:
   - **Formula**: 
     \[
     \text{ReLU}(x) = \max(0, x)
     \]
   - **Range**: \( [0, \infty) \)
   - **Characteristics**:
     - Computationally efficient and simple to implement.
     - Introduces non-linearity, allowing the model to learn complex patterns.
     - Widely used in modern neural networks, especially in deep learning.
   - **Drawbacks**:
     - **Dying ReLU Problem**: Neurons can "die" if they constantly output zero (e.g., for negative inputs), making them inactive and not contributing to learning.
     - Can cause some neurons to become inactive if the learning rate is too high.

In [None]:
# Using the ReLU activation function
relu = nn.ReLU()
output = relu(x)

print("ReLU Output:", output)


4. **Leaky ReLU**:
   - **Formula**: 
     \[
     \text{Leaky ReLU}(x) = \begin{cases} 
     x & \text{if } x > 0 \\
     \alpha x & \text{if } x \leq 0 
     \end{cases}
     \]
     where \( \alpha \) is a small positive constant (e.g., 0.01).
   - **Range**: \( (-\infty, \infty) \)
   - **Characteristics**:
     - Addresses the dying ReLU problem by allowing a small, non-zero gradient when the input is negative.
     - Helps keep neurons active during training.
   - **Variants**:
     - **Parametric ReLU (PReLU)**: Similar to Leaky ReLU, but \( \alpha \) is a learnable parameter.


In [None]:
# Using the Leaky ReLU activation function
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
output = leaky_relu(x)

print("Leaky ReLU Output:", output)


5. **ELU (Exponential Linear Unit)**:
   - **Formula**: 
     \[
     \text{ELU}(x) = \begin{cases} 
     x & \text{if } x > 0 \\
     \alpha (e^x - 1) & \text{if } x \leq 0 
     \end{cases}
     \]
   - **Range**: \( (-\alpha, \infty) \) where \( \alpha \) is a hyperparameter.
   - **Characteristics**:
     - ELU is designed to bring the average of the activations closer to zero, which speeds up learning.
     - Helps reduce the bias shift effect and maintains non-linear properties.
   - **Drawbacks**:
     - More computationally expensive than ReLU due to the exponential operation.


In [None]:
# Using the ELU activation function
elu = nn.ELU(alpha=1.0)
output = elu(x)

print("ELU Output:", output)


6. **Swish**:
   - **Formula**: 
     \[
     \text{Swish}(x) = x \cdot \sigma(x)
     \]
     where \( \sigma(x) \) is the sigmoid function.
   - **Range**: \( (-\infty, \infty) \)
   - **Characteristics**:
     - Smooth non-linearity that often outperforms ReLU in certain deep models.
     - Allows small negative values, which can help in learning.
   - **Drawbacks**:
     - More complex than ReLU, leading to slightly higher computational cost.


In [None]:
# Implementing Swish using PyTorch operations
def swish(x):
    return x * torch.sigmoid(x)

output = swish(x)

print("Swish Output:", output)


In [None]:
7. **Softmax Function**:
   - **Formula**: 
     \[
     \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
     \]
   - **Range**: \( (0, 1) \) for each element, with all outputs summing to 1.
   - **Characteristics**:
     - Used in the output layer of classification networks to represent probabilities over multiple classes.
   - **Drawbacks**:
     - Can be computationally intensive for a large number of classes due to the exponentiation and normalization operations.


In [None]:
# Example input tensor (logits for a classification task)
x = torch.tensor([2.0, 1.0, 0.1])

# Using the Softmax activation function
softmax = nn.Softmax(dim=0)  # dim=0 because we want to apply softmax along the first axis
output = softmax(x)

print("Softmax Output:", output)




### Choosing an Activation Function:
- **ReLU and its variants (Leaky ReLU, PReLU, ELU)** are typically the default choice for hidden layers due to their simplicity and effectiveness in practice.
- **Sigmoid** and **Tanh** are generally less favored for hidden layers due to their tendency to cause vanishing gradients but might still be used in certain network architectures or specific layers.
- **Softmax** is specifically used in multi-class classification problems at the output layer to produce a probability distribution over classes.
- **Swish** is a more recent innovation and might be chosen in some deep learning applications where its performance benefits are observed.

The choice of activation function can significantly impact the training dynamics and final performance of a neural network, so it’s important to consider the characteristics and drawbacks of each function when designing a model.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
INPUT_DIM = 100 # Number of features

In [None]:
# Generate synthetic data
X, y = make_classification(
    n_samples=5000,
    n_features=INPUT_DIM,
    n_informative=INPUT_DIM,
    n_redundant=0,
    random_state=7
)

In [None]:
X.shape, y.shape

((5000, 100), (5000,))

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train.shape, X_test.shape

((4000, 100), (1000, 100))

In [None]:
# Convert data to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test)

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, use_activation):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(INPUT_DIM, 16)
        self.fc2 = nn.Linear(16, 1)
        self.use_activation = use_activation

    def forward(self, x):
        x = F.relu(self.fc1(x)) if self.use_activation else self.fc1(x)
        x = torch.sigmoid(self.fc2(x))
        return x

In [None]:
# Function to train and evaluate the model
def train_and_evaluate(model):
    criterion = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Training loop
    for epoch in range(100):
        optimizer.zero_grad()
        outputs = model(X_train_tensor)
        loss = criterion(outputs, y_train_tensor.view(-1, 1))
        if epoch % 5 == 0:
          print(loss)
        loss.backward()
        optimizer.step()

    # Evaluate the model on the test set
    with torch.no_grad():
        model.eval()
        pred = model(X_test_tensor)
        predictions = (pred > 0.5).float().numpy()
        accuracy = accuracy_score(y_test_tensor, predictions)

    return accuracy

In [None]:
# Create and train the model without activation functions
model_without_activation = NeuralNetwork(use_activation=False)
accuracy_without_activation = train_and_evaluate(model_without_activation)

tensor(0.9071, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.7036, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.5720, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.4863, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.4303, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3936, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3688, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3513, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3386, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3291, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3219, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3162, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3118, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3082, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3053, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3030, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3011, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.2996, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.2984, grad_fn=<Bina

In [None]:
# Create and train the model with activation functions
model_with_activation = NeuralNetwork(use_activation=True)
accuracy_with_activation = train_and_evaluate(model_with_activation)

tensor(1.0072, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.8534, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.7328, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.6412, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.5724, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.5201, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.4793, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.4465, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.4188, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3950, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3742, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3557, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3388, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3231, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.3085, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.2947, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.2817, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.2695, grad_fn=<BinaryCrossEntropyBackward0>)
tensor(0.2578, grad_fn=<Bina

In [None]:
print("Accuracy without Activation Functions:", accuracy_without_activation)
print("Accuracy with Activation Functions:", accuracy_with_activation)

Accuracy without Activation Functions: 0.864
Accuracy with Activation Functions: 0.897


Note there may be situations where the adding of activation function decreases performance. This could be because adding activations causes overfitting. And maybe adding dropout could be useful.