# Explore Activation Functions

In this notebook, we'll use the MNIST dataset to explore activation functions

In [None]:
# get the MNIST data functions
# matplotlib for plotting
from keras.datasets import mnist
import numpy as np

## Let's start with the MNIST Dataset again!

In [None]:
## mnist.load_data() will automatically download the dataset if you don't have it
(MNIST_train_X, MNIST_train_y), (MNIST_test_X, MNIST_test_y) = mnist.load_data()

### A bit of preprocessing on the data before we can train (teach) our network

For today, just ignore this.  Consider it a "Necessary evil". It's really not evil, but it is necessary.

In [None]:
MNIST_train_X = MNIST_train_X.reshape((60000, 28 * 28))
MNIST_train_X = MNIST_train_X.astype('float32') / 255

MNIST_test_X = MNIST_test_X.reshape((10000, 28 * 28))
MNIST_test_X = MNIST_test_X.astype('float32') / 255

from keras.utils import to_categorical

MNIST_train_y = to_categorical(MNIST_train_y)
MNIST_test_y = to_categorical(MNIST_test_y)

In [None]:
from keras import models
from keras import layers

# Same old topology, minus some activation functions

So, now we can mess around with the activation functions.  We know that `relu` introduces nonlinearities. 
Let's do a few simple experiments

* let's try *removing* the relu function. 
* remove softmax function on the final layer
* try sigmoid instead of relu (with softmax)



In [None]:
network = models.Sequential() #we'll stick to sequential for this course

# we can adjust the activation function below.
network.add(layers.Dense(512, activation='sigmoid' , input_shape=(784,)))  # Dense is the same as fully connected.
network.add(layers.Dense(10, activation=None))

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',# change this parameter here.
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=128)


test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

# Record your results here

None + Softmax : 

None + None : 

Sig + Softmax : 

Sig + None  : 

# Why does None + Softmax do so much better than None + None?

your answer here:

# Regression - no final layer activation.

In the next section of the course we will begin working on a regression problem.  Regression is when we want to predict a continous value, like the price of a house.  This case we would not want to use a final layer activation, because it would "squish" everything between 0,1 (like sigmoid or softmax).  So keep this in mind, when your objective changes, you need to change your topology to match it.

# Further Experiment

Keras has a number of different activation functions.  Go ahead and try some others:

https://keras.io/activations/

And try mixing it up with multiple layers.  Can you find a way to get a higher accuracy score?

In [None]:
network = models.Sequential() #we'll stick to sequential for this course

#input layer
network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  # Dense is the same as fully connected.

# let's add another layer -- keep the activation function as 'relu'
network.add(layers.Dense(512, activation=''))
network.add(layers.Dense(284, activation=''))
network.add(layers.Dense(128, activation=''))

#output layer
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=128)

test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

# Record your results 
