#**Activation Functions in Neural Networks**

https://www.mygreatlearning.com/blog/activation-functions/

Activation functions play a crucial role in neural networks by determining whether a neuron should be activated or not. They introduce non-linearity, allowing networks to learn complex patterns. Without activation functions, a neural network would behave like a simple linear model, limiting its ability to solve real-world problems.


##**What Are Activation Functions?**

An activation function is a mathematical function applied to a neuron’s input to decide its output. It transforms the weighted sum of inputs into an output signal that is passed to the next layer in a neural network. The function’s primary objective is to introduce non-linearity into the network, enabling it to learn complex representations.

Without activation functions, a neural network could only model linear relationships, making it ineffective for solving non-trivial problems such as image classification, speech recognition, and natural language processing.

##**Why Are Activation Functions Necessary?**

Neural networks consist of multiple layers where neurons process input signals and pass them to subsequent layers. Everything inside a neural network becomes a basic linear transformation when activation functions are removed, which renders the network unable to discover complex features.

Key reasons why activation functions are necessary:

1. Introduce non-linearity: Real-world problems often involve complex, non-linear relationships. Activation functions enable neural networks to model these relationships.
2. Enable hierarchical feature learning: Deep networks extract multiple levels of features from raw data, making them more powerful for pattern recognition.

3. Prevent network collapse: Without activation functions, every layer would perform just a weighted sum, reducing the depth of the network into a single linear model.
4. Improve convergence during training: Certain activation functions help improve gradient flow, ensuring faster and more stable learning.

##**Types of Activation Functions**
1. Linear Activation Function
<img src="https://dotnettutorials.net/wp-content/uploads/2022/06/word-image-27976-1.png">

*Formula: f(x) = ax*

* The functioning produces an input value that has undergone scaling.
* The lack of non-linear elements prevents the network from fulfilling its complete learning capacity.
* Deep learning practitioners only infrequently use this activation function because it functions as a linear regression model.
* Use case: Often used in regression-based models where predicting continuous values is necessary.

2. Sigmoid Activation Function

<img src="https://dotnettutorials.net/wp-content/uploads/2022/06/word-image-27976-4.png">

<img src="https://dotnettutorials.net/wp-content/uploads/2022/06/word-image-27976-3.png">

* Outputs values between 0 and 1.
* Useful for probability-based models like binary classification.
* Advantages: Smooth gradient, well-defined range, and interpretable output as probabilities.
* Drawbacks: Prone to the vanishing gradient problem, leading to slow learning in deep networks. It is also computationally expensive due to the exponentiation operation.

3. Tanh (Hyperbolic Tangent) Activation Function
<img src="https://dotnettutorials.net/wp-content/uploads/2022/06/word-image-27976-6.png">

<img src="https://dotnettutorials.net/wp-content/uploads/2022/06/word-image-27976-5.png">

* Outputs values between -1 and 1.
* Centers the data around zero, helping in better gradient flow.
* Advantages: Scaled tanh activation offers enhanced gradient propagation since it operates from the zero-centered range.
* Drawbacks: The training of deep models becomes difficult because the deep networks experience reduced gradient propagation despite overcoming the sigmoid vanishing gradient problem.

4. ReLU (Rectified Linear Unit) Activation Function
<img src="https://dotnettutorials.net/wp-content/uploads/2022/06/word-image-27976-8.png">

<img src="https://dotnettutorials.net/wp-content/uploads/2022/06/word-image-27976-7.png">

* The most commonly used activation function in deep learning.
* Introduces non-linearity while avoiding the vanishing gradient problem.
* Advantages: Computationally efficient and prevents gradient saturation.
* Drawbacks: The implementation of “dying ReLU” results in dying neurons that stop learning because they become inactive when receiving negative inputs.

5. Leaky ReLU Activation Function

<img src="https://dotnettutorials.net/wp-content/uploads/2022/06/word-image-27976-9.png">

<img src="https://www.mygreatlearning.com/blog/wp-content/uploads/2020/08/leaky-relu-formula.png.webp">

* A modified version of ReLU to allow small gradients for negative inputs.
* Helps to prevent dying neurons.
* Advantages: Maintains non-linearity while addressing ReLU’s limitation.
* Drawbacks: Choosing the best negative slope value is not always straightforward. Performance varies across different datasets.

6. ELU (Exponential Linear Unit) Activation Function
<img src="https://raw.githubusercontent.com/mmuratarat/mmuratarat.github.io/master/_posts/images/elu_plot.png">
<img src="https://www.mygreatlearning.com/blog/wp-content/uploads/2020/08/elu-activation-formula.png.webp">

* The dying ReLU problem receives a solution because the activation function accepts small negative values.
* Advantages: Provides smooth gradient propagation and speeds up learning.
* Drawbacks: Computationally more expensive than ReLU, which can be an issue in large-scale applications.


7. Softmax Activation Function

<img src="https://images.contentstack.io/v3/assets/bltac01ee6daa3a1e14/blte5e1674e3883fab3/65ef8ba4039fdd4df8335b7c/img_blog_image1_inline_(2).png?width=1024&disable=upscale&auto=webp">

<img src="https://velog.velcdn.com/images/chiroya/post/b520fcca-ce29-4b02-9392-5de67767e6b4/image.png">

<img src="https://www.mygreatlearning.com/blog/wp-content/uploads/2020/08/softmax-activation-function-formula.png.webp">

* Used in multi-class classification problems.
* Converts logits into probabilities.
* Advantages: Ensures sum of probabilities equals 1, making it interpretable for classification tasks.
* Drawbacks: Computationally expensive and sensitive to outliers, as large input values can dominate the output.


<img src="https://www.mygreatlearning.com/blog/wp-content/uploads/2020/08/choosing-activation-function-1.png.webp">

In [None]:
import numpy as np


In [None]:
# sigmoid function
def sigmoid(z):
  return 1.0 / (1 + np.exp(-z))
# Derivative of sigmoid function
def sigmoid_prime(z):
  return sigmoid(z) * (1-sigmoid(z))

In [None]:
# tanh activation function
def tanh(z):
	return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))
# Derivative of Tanh Activation Function
def tanh_prime(z):
	return 1 - np.power(tanh(z), 2)

In [None]:
# ReLU activation function
def relu(z):
  return max(0, z)
# Derivative of ReLU Activation Function
def relu_prime(z):
  return 1 if z > 0 else 0


In [None]:
# Leaky_ReLU activation function
def leakyrelu(z, alpha):
	return max(alpha * z, z)
# Derivative of leaky_ReLU Activation Function
def leakyrelu_prime(z, alpha):
	return 1 if z > 0 else alpha

In [None]:
#softmax
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x))

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

[0.8360188  0.11314284 0.05083836]


###**Activativation functions using Tensorflow**

In [None]:
import tensorflow as tf

# ReLU activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(32,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
import tensorflow as tf

# Sigmoid activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='sigmoid', input_shape=(32,)),
    tf.keras.layers.Dense(1, activation='sigmoid')  # Used for binary classification
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
import tensorflow as tf

# Tanh activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='tanh', input_shape=(32,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [None]:
import tensorflow as tf

# ELU activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='elu', input_shape=(32,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [None]:
import tensorflow as tf

# Leaky ReLU activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, input_shape=(32,)),
    tf.keras.layers.LeakyReLU(alpha=0.3),  # Allows a small slope for negative values
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])



In [None]:
import tensorflow as tf

# Softmax activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(32,)),
    tf.keras.layers.Dense(10, activation='softmax')  # For multi-class classification
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [None]:
import tensorflow as tf

# Swish activation
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='swish', input_shape=(32,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])