# Artificial Neural Networks with Keras

FINALLY - lets goooo - Neural Networks, I've come to bargain!


In [1]:
import numpy as np
import pandas as pd
 
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

np.random.seed(24)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

ModuleNotFoundError: No module named 'pandas'

Things to look into:

Callbacks
Early Stopping
TensorBoard

## From biological to artificial neurons

Why this time ANNs might be good:
- huge quantity of data available, and ANNs outperform other ML algorithms on complex problems
- huge increase in computing powers since the 90s - Moore's law + gaming industry for the GPU cards
- better training algorithms
- limitations of ANN are benigh in practice
- virtuous circle of funding and progress


#### The perceptron

Based on a threshold logic unit (TLU) - inputs/outputs are not numbers and inputs have weight

Computing a weighted sum of its inputs, and then applying a step function to it.
h(x) = step(z), where z = w1x1 + w2x2... = x^Tw

TLU similar to logistic regression or linear SVM. Computes a linear combination of inputs, and if exceeding a threshold outputing the positive class, else negative. Can be used for simple binary classification.

Perceptron has a single layer of TLUs, and all are connected to all the inputs. When all neurons in a layer are connected to all neurons of previous layer, it's called a fully connected layer or dense layer. 

#### How to train perceptron:

Inspired by Hebb's rule - connection weights between 2 neurons increases when they have the same output. The training uses this rule and takes into account the network error and reduces it. The perceptron fed 1 training instance at a time, and for each instance makes predictions. For output neurons making wrong predictions, it reinforces the eonnction weights from the iputs that would have helped to make it correct.  

Output neuron decision boundaries are linear, so they can't learn complex pattersn (like logistic regression classifiers). 


Many expected highly from perceptrons, but it's similar to stochastic gradient descent - incapable of solving some trivial problems like simple linear classification models. Many, disappointed, dropped neural networks altogether.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)]
y = (iris.target == 0).astype(np.int)

per_clf = Perceptron()
per_clf.fit(X, y)

per_clf.predict([[2, 0.5]])

### Multi-layer Perceptron and Backpropagation

MLP contains one input layer, one or more layers of TLUs (hidden layers) and one final layer of TLUs called the output layer. Every layer, (except output) contains a bias neuron and is fully connected to the next layer. 

Signal flows in one direction - example of feedforward neural network (FNN)

Backpropagation - 1986, gradient descent using an effecient technique computing the gradients automatically. It can find out how each connection weight and bias term should be tweaked in order to reduce the error. Once gradients are found, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution. 

Auto gradient computation is called automatic differentiation, or autodiff. Backpropagation uses reverse-mode autodiff - fas/precise and suited when function differentiates many variables with few outputs.


<br>
<font color='green'><b>BACKPROPAGATION:</b></font>

<font color='blue'> 
<br>
    
- 1 mini-batch at a time. Goes through the training set multiple times - each pass = epoch
- Each mini-batch passed from one layer to the next - forward pass - same as making predictions
- Algorithm measures output error - uses loss function comparing desired output with actual output
- Computes how much each output connection contributed to that error - applying the chain rule
- Measures how error contributions came from each connection in the layer below - using the chain rule again, until it reaches the input layer. It measures the error gradient across all connection weights by propagating the error gradient backward through the network.
- Algorithm performs Gradient Descent, tweaking all connection weights in tehe network, using the error gradients just computed.
</font>

<br>
Summary:

- Prediction for each training instance
- Measures error
- Goes back to layers in reverse measuring the error contribution from each connection
- Tweaks the connection weights to reduce error


<b>Must randomly initilaize the weights.</b>

For backpropagation to work, the step function of the MLP was replaced with a logistic function. (sigmoid=a type of logistic function)

Other activation functions:
- Hyperbolictangent function - S-shaped, continuous and differntiable - output ranges from -1 to 1. Speeds up convergence.
- Rectified Linear Unit function - continuous but not differentiable at 0 - fast to compute - no maximum output and reduces Gradient Descent issues.


Why we have activation functions:

If chain contains only linear transformations, like f(x)=3x+2, g(x)=10x, then the output is still linear. Without non-linearity, the stack of layers is still a single layer.

### Regression MLPs

A single prediction only needs a single output neuron. For multivariate regressions (multiple values), one output neuron is required for every output dimension. E.g centre of image requires 2 output neurons (x,y coordinates).

For regression, activation functions are not needed, so they can output any range of values. The softplus activation function makes sure otuput are always positive, and the logistic or hyperbolic tangent function guarantees the outputs to always fall within a given range of values, by scaling the labels to the appropirate range. 

The loss function is typically the mean squared error. The mean absolute error used when lots of outliers. Huber Loss is the combination of both. 

Hashtag means number of.

| Hyperparameter | Typical Value |
| :-- | :-- |
| # input neurons | One per input feature (e.g., 28 x 28 = 784 for MNIST) |
| # hidden layers | Depends on the problem. Typically 1 to 5. |
| # neurons per hidden layer | Depends on the problem. Typically 10 to 100. |
| output neurons | 1 per prediction dimension |
| Hidden activation | ReLU (or SELU, see Chapter 11) |
| Output activation | None or ReLU/Softplus (if positive outputs) or Logistic/Tanh (if bounded outputs) |
| Loss function | MSE or MAE/Huber (if outliers) |


### Classification MLPs

For binary classification, just a single output neuron using the logistic activation function is good, estimating a probability. MLPs also great with multilabel binary classification tasks, just needing more output neurons. 

Multiclass classification, is when each instance belongs only to a single class. One output neuron per class and using the softmax activation function. 

Since predicting probability distributions, cross-entropy is usually good for the loss function. <br>


| Hyperparameter | Binary classification | Multilabel binary classification | Multiclass classifiation |
| :-- | :-- | :-- | :-- |
| Input and hidden layers | Same as regression | Same as regression | Same as regression |
| # output neurons | 1 |  1 per label | 1 per class |
| Output layer activation | Logistic | Logistic | Softmax |
| Loss function | Cross-Entropy | Cross-Entropy | Cross-Entropy

# Implementing MLPs with Keras

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
keras.backend.clear_session()
np.random.seed(24)
tf.random.set_seed(24)

## Image Classifier - Sequential API

In [None]:
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# Image represented as 28*28 array instead of 1D array of 784
X_train_full.shape

In [None]:
# We are using Gradient Descent, so must scale input features. For simplicity scaling the pixel intensities down
# to 0~1 range by division by 255.0.

# Validation Sets
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle Boot']

class_names[y_train[0]]

In [None]:
n_rows = 4
n_cols = 10
plt.figure(figsize=(n_cols * 1.2, n_rows * 1.2))
for row in range(n_rows):
    for col in range(n_cols):
        index = n_cols * row + col
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(X_train[index], cmap="binary", interpolation="nearest")
        plt.axis('off')
        plt.title(class_names[y_train[index]], fontsize=12)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

In [None]:
# Creating model using the Sequential API

model = keras.models.Sequential() # A sequential model - simpliest - networks with single stack of layers
model.add(keras.layers.Flatten(input_shape=[28, 28])) # Input - convert each image into 1D array - simple preprocessing - could also add keras.layers.InputLayer as first layer
model.add(keras.layers.Dense(300, activation='relu')) # 2 Hidden
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(10, activation='softmax')) # Output - 10 neurons since one per class - using softmax because classes are exclusive

# Alternative method

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

# If using keras.io will need to change imports

In [None]:
model.summary()

The first hidden layer has 784 * 300 connection weights, with 300 bias tersm, adding up to 235500 parameters. A lot of flexibility to train the data, but could also overfit when there's not much training data.

Quite important to specify the input shape. Otherwise will need to wait - over-complication.

In [None]:
model.layers

In [None]:
model.layers[1].name

In [None]:
model.get_layer('dense_3').name

In [None]:
weights, biases = model.layers[1].get_weights()

print(weights.shape)
weights

In [None]:
print(biases.shape)
biases

The dense layer initialized the connection weights randomly (as required), and biases initialized to zeros. Use kernel_initilaizer for a different initialization method, or bias_initializer. 

In [None]:
### Compiling the Model

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy']) # Can also specify extra metrics to compute during training/evaluation

# Full list of losses, optimizers and metrics
# https://keras.io/api/losses/
# https://keras.io/api/optimizers/
# https://keras.io/api/metrics/

We use the sparse categorical crossentrophy oss since we have sparse labels (for each instance there is target classindex), and classes are exclusive. We can use keras.utils.to_categorical to convert sparse to one-hot vector labels. Vise versa use ng.argmax() with axis=1.

For optimizer, sgd means we train the model using simple Stochastic Gradient Descent - Keras will do backpropagation (reverse-mode autodiff + Gradient Descent). There are more effecient optimizers.

Since it's classifier, useful to specify accuracy in training/evaluation.

In [None]:
history = model.fit(X_train, y_train, epochs=30,
                    validation_data = (X_valid, y_valid))

# Here 1719 is not the num of training samples but num of batches. It is default to 32 - 55000/32=1718.75.

NEURAL NETWORK TRAINED!

Each epoch, number of instances processed so far, mean training time, the loss and accuracy. With more epochs training loss decreases, and validation accuracy reaches 95%! (8% increase than book). 

Can also use validation_split instead of passing the entire validation set.

In [None]:
history.params

In [None]:
import pandas as pd

pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1)
plt.show()

In [None]:
model.evaluate(X_test, y_test)

In [None]:
X_new = X_test[:3]
y_proba = model.predict(X_new)
y_proba.round(2)

In [None]:
y_pred = model.predict_classes(X_new)
np.array(class_names)[y_pred]

## Regression MLP - Sequential API

All of the writing is lost due to Jupyter Notebook not autosaving. See actual Book for the details again.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

In [None]:
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1)
])
model.compile(loss="mean_squared_error", optimizer=keras.optimizers.SGD(lr=1e-3))
history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3]
y_pred = model.predict(X_new)

In [None]:
plt.plot(pd.DataFrame(history.history))
plt.grid(True)
plt.gca().set_ylim(0, 1)
plt.show()

In [None]:
y_pred

## Functinoal API

More complex toplogies.

In [None]:
keras.backend.clear_session()
np.random.seed(24)
tf.random.set_seed(24)

In [None]:
input_ = keras.layers.Input(shape=X_train.shape[1:])
hidden1 = keras.layers.Dense(30, activation="relu")(input_)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_, hidden2])
output = keras.layers.Dense(1)(concat)

model = keras.models.Model(inputs=[input_], outputs=[output])

model.summary()

In [None]:
model.compile(loss="mean_squared_error", optimizer=keras.optimizers.SGD(lr=1e-3))
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
y_pred = model.predict(X_new)

What if you want to send different subsets of input features through the wide or deep paths? We will send 5 features (features 0 to 4), and 6 through the deep path (features 2 to 7). Note that 3 features will go through both (features 2, 3 and 4).

In [None]:
input_A = keras.layers.Input(shape=[5], name="wide_input")
input_B = keras.layers.Input(shape=[6], name="deep_input")
hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])
output = keras.layers.Dense(1, name="output")(concat)

model = keras.models.Model(inputs=[input_A, input_B], outputs=[output])

In [None]:
model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:]
X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:]
X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:]
X_new_A, X_new_B = X_test_A[:3], X_test_B[:3]

history = model.fit((X_train_A, X_train_B), y_train, epochs=20,
                    validation_data=((X_valid_A, X_valid_B), y_valid))
mse_test = model.evaluate((X_test_A, X_test_B), y_test)
y_pred = model.predict((X_new_A, X_new_B))

Auxiliary output for regularization:

In [None]:
input_A = keras.layers.Input(shape=[5], name="wide_input")
input_B = keras.layers.Input(shape=[6], name="deep_input")
hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
concat = keras.layers.concatenate([input_A, hidden2])
output = keras.layers.Dense(1, name="main_output")(concat)
aux_output = keras.layers.Dense(1, name="aux_output")(hidden2)

model = keras.models.Model(inputs=[input_A, input_B],
                           outputs=[output, aux_output])

In [None]:
model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer=keras.optimizers.SGD(lr=1e-3))

In [None]:
history = model.fit([X_train_A, X_train_B], [y_train, y_train], epochs=20,
                    validation_data=([X_valid_A, X_valid_B], [y_valid, y_valid]))

In [None]:
total_loss, main_loss, aux_loss = model.evaluate(
    [X_test_A, X_test_B], [y_test, y_test])

y_pred_main, y_pred_aux = model.predict([X_new_A, X_new_B])

## Dynamic Models - Subclassing API

Back on track again.

Both Sequential/Functional are declarative - # of layers.
Advantages: easily saved/shared, structure easily displayed, errors can be caught early, easy to debug, etc.
But it's static. 

In [None]:
class WideAndDeepModel(keras.models.Model):
    
    def __init__(self, units=30, activation='relu', **kwargs):
        super().__init__(**kwargs) #handles standard arguments
        self.hidden1 = keras.layers.Dense(units, activation=activation)
        self.hidden2 = keras.layers.Dense(units, activation=activation)
        self.main_output = keras.layers.Dense(1)
        self.aux_output = keras.layers.Dense(1)
        
    def call(self, inputs):
        input_A, input_B = inputs
        hidden1 = self.hidden1(input_B)
        hidden2 = self.hidden2(hidden1)
        concat = keras.layers.concatenate([input_A, hidden2])
        main_output = self.main_output(concat)
        aux_output = self.aux_output(hidden2)
        
        return main_output, aux_output
    
model = WideAndDeepModel()


# Unlimited possibilites within the call function. Great with experimentations.
# However, more difficult to inspect, save or clone. Cannot check and easiler to make mistakes. 
    

## Saving/Restoring model

model.save("model_name")

model = keras.models.load_model("model_name")

If model takes long time to train, don't just save model at end but save checkpoints. Use Callbacks:


### Using Callbacks

Fit accepts callbacks argument letting Keras call during training, at start/end, start/end of each epoch and start/end of processing each batch. 

In [None]:
keras.backend.clear_session()
np.random.seed(24)
tf.random.set_seed(24)

In [None]:
# Randomly training
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=[8]),
    keras.layers.Dense(30, activation="relu"),
    keras.layers.Dense(1)
])    

In [None]:
model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)

In [None]:
model.save("my_keras_model.h5")
model = keras.models.load_model("my_keras_model.h5")
# Saving and reloading

In [None]:
# Doing the same thing, but with callbacks:

keras.backend.clear_session()
np.random.seed(24)
tf.random.set_seed(24)

In [None]:
model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5", save_best_only=True)

history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid),
                    callbacks=[checkpoint_cb])

model = keras.models.load_model("my_keras_model.h5") # rollback to best model
mse_test = model.evaluate(X_test, y_test)

# Another way of doing it is with early stopping. Can also write custom callbacks.

## Visualization Using TensorBoard

View learning curve during training, compare learning curves, visualize computation graph, etcetc.

Modify program - output data wanting to visualize to special binary log file. Point TensorBoard server to root log directory, and configure program for different subdirectory saving. This way you can visualize/compare data from multiple runs.

In [None]:
# Defining root log directory - current time so always different

root_logdir = os.path.join(os.curdir, "my_logs")

def get_run_logdir():
    import time
    run_id = time.strftime('run_%Y_%m_%d-%H_%M_%S')
    return os.path.join(root_logdir, run_id)

run_logdir = get_run_logdir()

# TensorBoard callback

# After building/compiling model
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
history = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid),
                    callbacks=[tensorboard_cb])


# USEFUL THING TO DO - look into this later

In [None]:
# Fine-Tuning Hyperparameters

# Approach 1 - which works best on validation set - GridSearch

def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3, input_shape=[8]):
    model = keras.models.Sequential()
    options = {"input_shape": input_shape}
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation="relu", **options))
        options = {}
    model.add(keras.layers.Dense(1, **options))
    optimizer = keras.optimizers.SGD(learning_rate)
    model.compile(loss="mse", optimizer=optimizer)
    return model

In [None]:
# FUCKING piece of shit not saving again. fuck this shit. die in hell bitches