# Activation Functions

## Our Sources

* https://keras.io/api/layers/activations/
* https://www.statlearning.com/ (Chapter 10)
* https://en.wikipedia.org/wiki/Activation_function

## Introduction

Recall the equation given for a neural network layer (see Basics of Neural Networks for a reminder of its interpretation): 

$f(X) =\beta_0+\sum_{k=1}^{K}\beta_kh_k(X)$
<br>$ f(X)=\beta_0+\sum_{k=1}^{K}\beta_kg(w_{k0}+\sum_{j=1}^{p}w_{kj}X_j)$

Recall that the activation functions are the portions

$A_k = h_k(X) = g(w_{k0}+\sum_{j=1}^{p}w_{kj}X_{j})=g(z)$

The commonly used shorthand is $g(z)$.

Activation functions are critical to neural networks. They serve 2 primary purposes:

1. They create nonlinearity (where otherwise you might have a simple linear relationship)
2. They ensure the model can capture complex nonlinearities and interactions between the variables

In short, activation functions are what ensure neural networks produce nonlinear output. There are other properties that activation functions can bring, but they differ depending on which activation function gets used. 

## Activation Functions provided by `keras` (1 point)

There are a few major "families" of activation functions that have the desirable properties for activation functions, including:

* *-LU: Linear Unit (such as ReLU)
* Sigmoid
* Hyperbolic (tanh)
* Softmax

There are certainly others as activation functions are still heavily under research, however these are the most common varieties you will see in use today.

`keras` carries all of these, with multiple different varieties. Below we list some of the most commonly used activation functions and their use case:

* ReLU (Rectified Linear Unit): Most commonly used. Good general purpose activation function. Can be computed and stored more efficiently than sigmoid (see https://dl.acm.org/doi/pdf/10.1145/3065386, under section 4.1).
* Sigmoid: Used to be the most commonly used, before ReLU. Is also used in logistic regression to convert a linear function into probabilities between 0 and 1.
* Softmax: Also used for (multinomial) logistic regression. Used when there is categorical output and when you want a probability of the chance of each given category. A classic example is using a neural network to classify images of animals: softmax could be used to create an output of probabilities that sum to 1, where a probability is outputted for the chance that the given test image is a specific animal (label).

In general, it's not a bad idea to stick to the Linear Unit family (ReLU, GELU, etc) of activation functions (ReLU in particular) unless you are doing categorical output, in which case a softmax family function (softmax, softplus, etc) might come in handy. 

## Implementing Activation Functions in `keras` (2 points)

Adding activation functions in `keras` is a breeze. Below is a simple implementation of a single layer neural network **that will have ReLU in the hidden layer and softmax for the output layer**. You will notice that the final output layer has 10 neurons; this means that our softmax activation function will have 10 total categories to make probabilities for (which all sum to 1).

**For the cell below, type in the `keras` syntax for both of the activation functions**. Verify it's correct with https://keras.io/api/layers/activations/ if needed.

In [1]:
from keras.models import Sequential
from keras.layers import Input, Dense

2023-12-04 15:28:52.700833: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-04 15:28:53.086533: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-04 15:28:53.086569: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-04 15:28:53.089238: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-04 15:28:53.295797: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-04 15:28:53.298007: I tensorflow/core/platform/cpu_feature_guard.cc:182] This Tens

In [7]:
hidden_layer_1_activation = ??
output_layer_activation = ??

SyntaxError: invalid syntax (3510875304.py, line 1)

In [None]:
#check the strings on these for points

In [8]:
model = Sequential()
model.add(Input(shape=(82,))) #input layer
model.add(Dense(41, activation=hidden_layer_1_activation)) #hidden layer 1
model.add(Dense(10, activation=output_layer_activation)) #output layer

NameError: name 'output_layer_activation' is not defined

## ReLU: Deep Dive (1 point)

The ReLU activation function is the most commonly used because of its efficient computational properties (relative to the previous standard, sigmoid). 

The piecewise function is defined as

$ g(z) = \left\{\begin{array}{ll}0 & \text{if} \ z<0 \\z & \text{otherwise} \\\end{array} \right. $

It's a rather simple function that works incredibly well. One thing to point out is that it thresholds at 0; its important to recall that there is still the $w_{k0}$ which will shift the whole equation. Recall the single layer neural network equation for the hidden layer:

$ f(X)=\beta_0+\sum_{k=1}^{K}\beta_kg(w_{k0}+\sum_{j=1}^{p}w_{kj}X_j)$

We pointed out in the introduction that 

$g(w_{k0}+\sum_{j=1}^{p}w_{kj}X_{j})=g(z)$

So $z$ here is the result of the equation $w_{k0}+\sum_{j=1}^{p}w_{kj}X_{j}$. Recall that $w_k$ is the weight at the layer $k$, and $p$ is the total number of predictors/inputs. $w_{k0}$ is the bias intercept at the layer; and the summation we see is the summation of all the weights 

There are many other varieties of the Linear Unit family of functions, such as GELU ("Gaussian Error Linear Units") which is used for BERT, ChatGPT and the like (source: https://arxiv.org/pdf/1606.08415.pdf). ReLU is the "tried and true" gold standard and the other Linear Unit functions often attempt to improve ReLU.




Implement ReLU with options

https://keras.io/api/layers/activations/#relu-function



In [None]:
relu_model = Sequential()
relu_model.add(Input(shape=(129,))) #input layer
relu_model.add(Dense(13, activation=hidden_layer_1_activation)) #hidden layer 1
relu_model.add(Dense(1, activation='sigmoid')) #output layer

## Softmax: Deep Dive (1 point)

Explore the ins and outs of softmax, when it gets used, etc