# Parallelising Neural Network Training with Keras

## Choosing activation functions for multilayer networks

* Technically, one can use **any function** as an activation function in multilayer neural networks as long as it is **differentiable**
* One could can even use **linear** activation functions, such as in Adaline, but ...


- would **not** be very useful to use linear activation functions for both hidden and output layers 
- to tackle complex problems one needs to introduce **non-linearity**
- the **sum of linear functions** would yield only **another** linear function


### Activation Functions: Pros and Cons

| **Activation Function** | **Pros**                               | **Cons**                                    |
|-------------------------|----------------------------------------|---------------------------------------------|
| **Linear**              | Simple, good for regression            | No non-linearity, cannot handle complex data|
| **Unit Step**           | Simple, useful in binary classification| Not differentiable, poor for gradient-based learning |
| **Sign**                | Easy to compute                        | Non-differentiable, limits model complexity |
| **Piece-wise Linear**    | Used in specific algorithms (SVMs)     | More complex, limited use in neural networks|
| **Logistic (Sigmoid)**  | Smooth output, good for probabilities  | Vanishing gradient problem, slow convergence|
| **Tanh (Hyperbolic Tangent)** | Zero-centered output, stronger gradient | Suffers from vanishing gradients          |
| **ReLU**                | Efficient, fast convergence, avoids vanishing gradients | Can "die" for negative inputs, not zero-centered |


<img src="./images/overview_actfunc.png" width="600"/>

* The **logistic activation function** (which we often called *sigmoid function*) mimics the concept of a neuron in a brain most closely - think of it as the probability of whether a neuron fires or not
* However, logistic activation functions can be problematic
    - When net input $\textbf{z}$ is **highly negative**, $\phi{(\textbf{z})}$ would be close to zero
    - If $\phi{(\textbf{z})}$ is close to zero the neural network would learn **very slowly**
    - More slowly learning could lead to the neural network **getting trapped in local minima** during training

### Estimating class probabilities in multiclass classification via the ``softmax`` function

* In previous sections: obtain a class label using the ``argmax`` function
* The ``softmax`` function is in fact a soft form of the ``argmax`` function; instead of giving a single class index, it provides the **probability of each class**
* The ``softmax`` function allows for computation of **meaningful** class probabilities in **multiclass** settings (multinomial logistic regression)

In ``softmax``, the probability of a particular sample with net input $z$ belonging to the $i$th class can be computed with a normalization term in the denominator, that is, the sum of all $M$ linear functions.
We do not use Softmax in hidden layers: If a hidden layer has multiple neurons, each neuron is supposed to activate independently, meaning they respond to different patterns in the input. However, if we use softmax in a hidden layer, it forces the neurons to compete with each other because their outputs must add up to 1. This would interfere with their job of learning different features from the data and limit the flexibility of the network to learn effectively


 <img src="./images/softmax_eq.png" width="500"/>
 
 - $p(y = i \mid z)$  
  → "The probability that the output $y$ equals class $i$, given the input vector $z$."

- $e^{z_i}$  
  → "Take the exponential of the score (logit) $z_i$ for class $i$."

- $\sum_{j=1}^{M} e^{z_j}$  
  → "Compute the sum of exponentials of all class scores $z_j$, where $M$ is the total number of classes."

- $\frac{e^{z_i}}{\sum_{j=1}^{M} e^{z_j}}$  
  → "Divide the exponential of class $i$ by the sum of exponentials of all classes, to normalize into a probability."

The ``softmax`` function coded in Python:

In [1]:
import numpy as np

def softmax(z):
    # subtract max for numerical stability
    exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example input vector (logits)
Z = np.array([2.0, 1.0, 0.1])

y_probas = softmax(Z)
print("Probabilities:\n", y_probas)
print("Sum of probabilities:", np.sum(y_probas))


Probabilities:
 [0.65900114 0.24243297 0.09856589]
Sum of probabilities: 1.0


In [5]:
np.sum(y_probas)

1.0

In [6]:
np.argmax(y_probas)

0

##  Example: Softmax with 3 values

We want to compute the softmax for:

$z = [2.0, \; 1.0, \; 0.1]$

---

### Step 1: Exponentiate each value

$e^{2.0} \approx 7.389$  
$e^{1.0} \approx 2.718$  
$e^{0.1} \approx 1.105$

So:  

$e^z = [7.389, \; 2.718, \; 1.105]$

---

### Step 2: Compute denominator (sum of exponentials)

$\text{denominator} = 7.389 + 2.718 + 1.105 = 11.212$



---

### Step 3: Divide each exponential by denominator

$\text{softmax}(2.0) = \frac{7.389}{11.212} \approx 0.659$  

$\text{softmax}(1.0) = \frac{2.718}{11.212} \approx 0.242$  

$\text{softmax}(0.1) = \frac{1.105}{11.212} \approx 0.099$

---

###  Final Result

$\text{softmax}(z) = [0.659, \; 0.242, \; 0.099]$

Notice:  

$0.659 + 0.242 + 0.099 \approx 1.0$


### Broadening the output spectrum using a hyperbolic tangent

* Another *sigmoid function* that is often used in the **hidden layers** of artificial neural networks is the **hyperbolic tangent** (commonly known as ``tanh``)
* ``tanh`` can be interpreted as a rescaled version of the logistic function

<img src="./images/log_&_tanh.png" width="700"/>


**Advantage of the hyperbolic tangent over the logistic function**

* It has a **broader output spectrum** and ranges in the **open interval** (-1, 1)
* This can improve the convergence of the back propagation algorithm [Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, pages: 500-501, 1995](https://www.microsoft.com/en-us/research/wp-content/uploads/1996/01/neural_networks_pattern_recognition.pdf)
* In contrast, the logistic function returns an output signal that ranges in the open interval (0, 1)

### Rectified linear unit activation

* ``tanh`` and ``logistic`` activations suffer from **vanishing gradient problem**
* This means the **derivative of activations** with respect to net input **diminishes** as $z$ becomes large
* As a result, **learning weights during the training phase** become **very slow** because the gradient terms may be **very close to zero**
* ReLU activation addresses this issue

Mathematically, ReLU is defined as follows:

<img src="./images/relu.png" width="400"/>

* ReLU is still a nonlinear function that is good for learning complex functions with neural networks
* Besides this, the derivative of ReLU, with respect to its input, is always 1 for positive input values
* Therefore, it **solves** the problem of vanishing gradients, making it **suitable for deep neural networks**

## Other Types of ReLU

For a good overview of ReLU variants (Leaky ReLU, PReLU, ELU, SELU, etc.), see:

[ReLU Activation Function and Its Variants — PythonKitchen](https://www.pythonkitchen.com/relu-activation-function-and-its-variants/?utm_source=chatgpt.com)


## Loss function: Cross entropy

* Cross entropy loss, or log loss, measures the **performance of a classification model** whose output represents a **probability**, that is, a value between 0 and 1. 
* Cross entropy loss **increases** as the predicted probability **diverges** from the actual label. So predicting a probability of for example .017 when the actual observation label is 1 would result in a **high** loss value. 
* A perfect model would have a log loss of 0. 



**Choice of cross entropy in Keras**

* Binary classification problems: use binary cross entropy (``binary_crossentropy`` in Keras)
* Multi-class classification problems: use categorical cross entropy (``categorical_crossentropy`` in Keras)

### Binary cross entropy

<img src="./images/binary_logloss.png" width="700"/>

<img src="./images/cross_entropy.png" width="900"/>

<img src="./images/categorical_logloss.png" width="600"/>

### Summary of Differences for Cross entropy:
- **Binary Classification**:
  - Two classes (0 or 1).
  - One predicted probability for class 1.
  - Loss is calculated for this single probability.

- **Multiclass Classification**:
  - More than two classes.
  - Multiple predicted probabilities (one for each class).
  - Loss is calculated over all classes, penalizing the model for the true class.