#  Activation Functions Crash Course in 75 Minutes | Deep Learning with Tensorflow

Written by: M.Danish Azeem\
Date: 02.10.2024\
Email: danishazeem365@gmail.com

Assignment #(1)  
 
Time stamp   58:05

# what are the difference between SoftMax and sigmoid function


Both SoftMax and Sigmoid functions are activation functions used in neural networks, but they have distinct functionalities and applications:

**Sigmoid Function:**

* **Output:** Values between 0 and 1, representing probabilities for a **single class**.
* **Applications:** Useful for binary classification problems (e.g., predicting if an image is a cat or not a cat).
* **Interpretation:** Output closer to 1 indicates higher probability of belonging to the positive class, closer to 0 indicates lower probability.

**SoftMax Function:**

* **Output:** Values between 0 and 1, representing **probabilities for all possible classes**, where values sum to 1.
* **Applications:** Used for **multi-class classification problems** (e.g., predicting what type of clothing is in an image).
* **Interpretation:** Highest output value corresponds to the class with the highest probability.

**Key Differences:**

| Feature | Sigmoid | SoftMax |
|---|---|---|
| Output range | 0-1 | 0-1, sum to 1 |
| Number of classes | Single | Multiple |
| Applications | Binary classification | Multi-class classification |
| Interpretation | Probability for one class | Probabilities for all classes relative to each other |

**Choosing the right function:**

* Use Sigmoid for predicting a single class (e.g., is it spam or not?).
* Use SoftMax for predicting one class from multiple options (e.g., cat, dog, bird).

**Additional notes:**

* Sigmoid can be used for multi-class problems by applying it independently to each output neuron, but SoftMax is generally preferred for its efficiency and better compatibility with cross-entropy loss functions.
* Other activation functions exist, each with its own strengths and weaknesses depending on the specific task and network architecture.



Assignment #(2)  
 
Time stamp   58:05

# How we do know, which activation function should we use in hidden or out put layers?


Choosing the right activation function for hidden and output layers in neural networks is an important decision that can significantly impact your model's performance. Here are some key factors to consider:

**Hidden Layers:**

* **Non-linearity:** In most cases, you want non-linear activation functions in hidden layers. This allows the network to learn complex relationships and patterns in the data, as a linear model (with linear activations) can only represent linear relationships.
* **Common choices:** ReLU (Rectified Linear Unit) is the most popular choice due to its efficiency and vanishing gradient mitigation. Other options include Leaky ReLU, ELU (Exponential Linear Unit), and Tanh (Hyperbolic Tangent).
* **Considerations:** ReLU may die with negative values, Leaky ReLU addresses this, while ELU and Tanh have bounded outputs (-1 to 1 for Tanh). Experimentation is often necessary to find the best option for your specific problem.

**Output Layer:**

* **Output range:** The activation function should match the expected output range of your problem.
* **Binary classification:** Use Sigmoid when the output is a single probability between 0 and 1 (e.g., is it spam or not?).
* **Multi-class classification:** Use SoftMax when you have multiple mutually exclusive classes, and the outputs represent probabilities for each class, summing to 1 (e.g., predicting the type of clothing in an image).
* **Regression:** Use a linear activation function (no activation) for continuous outputs (e.g., predicting house prices).

**Additional Tips:**

* Consider the activation function's impact on the loss function used during training, as some combinations work better together.
* Experiment with different options and evaluate your model's performance on both training and validation data to find the best fit.
* Pay attention to potential issues like vanishing/exploding gradients with certain activation functions and choose accordingly.

Remember, there's no one-size-fits-all answer. The optimal activation function depends on your specific task, dataset, and network architecture. Be prepared to experiment and compare different options to achieve the best results for your model!



Assignment #(3)  
 
Time stamp   01:10:41

# what are mutually exclusive and mutually inclusive?


Mutually exclusive and mutually inclusive are terms used to describe the relationship between events or sets. Here's a breakdown:

**Mutually exclusive:**

* Events or sets are **mutually exclusive** if they cannot occur or belong to each other **at the same time**. They are like disjoint circles that never overlap.
* Examples:
    * Rolling a 6 on a die and rolling a 1 on the same die are mutually exclusive (can't happen simultaneously).
    * Being a mammal and being a bird are mutually exclusive (you can't be both).
    * Sets {even numbers} and {odd numbers} are mutually exclusive (no number can be both even and odd).

**Mutually inclusive:**

* Events or sets are **mutually inclusive** if they are not **necessarily** exclusive, meaning they **can potentially overlap** or even be identical.
* Examples:
    * Picking a red card and picking a diamond card from a deck are not mutually exclusive (both can be red diamonds).
    * Enjoying music and enjoying movies are not mutually exclusive (someone can enjoy both).
    * Sets {numbers less than 5} and {even numbers} are mutually inclusive (some numbers like 2 are in both sets).

**Key differences:**

| Feature | Mutually exclusive | Mutually inclusive |
|---|---|---|
| Overlap | No overlap | Can overlap |
| Occurrence | Cannot occur together | Can occur together |
| Example | Rolling even vs. odd | Enjoying music vs. movies |

**Importance:**

* Understanding these concepts is crucial in various fields, including probability, statistics, logic, and machine learning.
* They help analyze events, classify data, and make informed decisions based on potential occurrences.



Assignment #(4)  
 
Time stamp   01:13:35

# Write down the mathematical formula, advantages and disadvantages of each activation function.



## Activation Functions in Neural Networks: Formulas, Advantages, and Disadvantages

Here's a breakdown of the common activation functions you mentioned, including their formulas, advantages, and disadvantages:

**1. Linear Activation Function:**

* **Formula:** `f(x) = x`
* **Advantages:**
    * Simple and computationally efficient.
    * Preserves all information from the input.
* **Disadvantages:**
    * Limited expressiveness due to linearity.
    * Can suffer from vanishing gradient problem.
    * Not suitable for non-linear problems.

**2. Step/Binary Activation Function:**

* **Formula:** `f(x) = { 1 if x >= 0, 0 if x < 0 }`
* **Advantages:**
    * Simple and interpretable.
    * Useful for binary classification tasks.
* **Disadvantages:**
    * Creates hard boundaries, potentially losing information.
    * Cannot be used for complex problems requiring smooth transitions.

**3. Sigmoid Function:**

* **Formula:** `f(x) = 1 / (1 + exp(-x))`
* **Advantages:**
    * Outputs values between 0 and 1, suitable for probability interpretation.
    * Smooth and differentiable.
* **Disadvantages:**
    * Computationally expensive compared to ReLU.
    * Can suffer from vanishing gradient problem for large negative inputs.
    * Outputs saturate near 0 and 1, limiting learning capacity in deeper layers.

**4. Tanh Function:**

* **Formula:** `f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))`
* **Advantages:**
    * Outputs values between -1 and 1, useful for bipolar problems.
    * Smooth and differentiable.
    * Zero-centered, avoiding vanishing gradient problem for small inputs.
* **Disadvantages:**
    * Similar computational cost to Sigmoid.
    * Outputs still saturate near -1 and 1, limiting learning capacity.

**5. ReLU Function:**

* **Formula:** `f(x) = max(0, x)`
* **Advantages:**
    * Simple and computationally efficient.
    * Does not suffer from vanishing gradient problem.
    * Widely used due to its effectiveness in many tasks.
* **Disadvantages:**
    * Can die (output 0) for negative inputs, potentially losing information.
    * May lead to the "dead ReLU" problem if many neurons die simultaneously.

**6. Extended ReLU Functions:**

* **Leaky ReLU:** `f(x) = max(alpha * x, x)` (where alpha is a small positive value)
* **Parametric ReLU:** `f(x) = max(a * x + b, x)` (where a and b are learned parameters)
* **Advantages:**
    * Address the "dead ReLU" problem by allowing small positive gradients for negative inputs.
    * Parametric ReLU offers more flexibility in learning the activation function.
* **Disadvantages:**
    * Introduce additional hyperparameters that need tuning.
    * May be slightly less computationally efficient than ReLU.

**7. Softmax Function:**

* **Formula:** `f_i(x) = exp(x_i) / sum(exp(x_j)) for all i, j`
* **Advantages:**
    * Outputs a probability distribution for multi-class classification.
    * Ensures outputs sum to 1, representing mutually exclusive classes.
* **Disadvantages:**
    * Computationally expensive compared to ReLU.
    * Numerically unstable for large inputs, requiring careful implementation.

**Choosing the right activation function depends on your specific task, dataset, and network architecture.** Experiment with different options and evaluate your model's performance to find the best fit!