In [None]:
import numpy as np, matplotlib.pyplot as plt, tensorflow as tf, tensorflow.keras as keras # Package imports 

# Neural Networks - Deep Learning

Inspired by the brain, the neurons refer to objects that hold information, and they are connected by functions that form the network structure.

As with the other machine learning methods mentioned earlier, a model must be trained on data, before the trained model can be used for inference on new data. Again, the data can come in both the form of labelled and unlabelled data for various supervised and unsupervised tasks.

To explore Deep Learning we will use the task of image classification. Here we will use a deep neural network to classify handwritten digits from the MNIST database which contains 60,000 images for training and 10,000 test images. First we will load our dataset.

#### Load MNIST Handwritten Digits Dataset

Here the inputs to the network are the N x N pixel images, with N=28, and the output layer consists of 10 neurons, each reprenting a value in the set of [0, 9]. Here a trained network should be able to take in an input image, process what within the image characterises the digit that is represents.

In [None]:
# import MNIST dataset
# The MNIST data is split between 60,000 28 x 28 pixel training images and 10,000 28 x 28 pixel images
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()

print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print("X_test shape", X_test.shape)
print("y_test shape", y_test.shape)

#### Visualise subset of the dataset

In [None]:
plt.figure(figsize=(9,9))

for i in range(9):
    plt.subplot(3,3,i+1)
    num = np.random.randint(0, len(X_train))
    plt.imshow(X_train[num], cmap='gray', interpolation='none')
    plt.title("Class {}".format(y_train[num]))

#### Format Data for Machine Learning

In [None]:
X_train = X_train.astype('float32')/255  # Normalising data and setting data type to float
X_test = X_test.astype('float32')/255    # Normalising data and setting data type to float

nb_classes = 10 # number of classes, representing the number of unquie digits

Y_train = keras.utils.to_categorical(y_train, nb_classes) 
Y_test = keras.utils.to_categorical(y_test, nb_classes)

How would a neural network do this?

## Feed Forward Neural Netwok

Basic network strucures have an input layer, of size corresponding to the "dimension" of the incoming data, and an outplut layer that is specific to the task. For a regression task this may be a single neuron, to give a single value, or for a classification task this would correspond to the number of classes.

What makes Neural Networks powerful, over the methods used earlier is their ability to have hidden layer between the input and output layers, and this ability to nest operations makes them able to capture more complex patterns in the data - these additional layers are referred to as 'Hidden Layers'.

![FFNN](img/nn.png)

This image shows a feed forward neural network- named as such because nothing particualrly special is taking place, the outputs of one layer, become the inputs to the next layer, and thus the information is few forward through the network from the input layer to the output layer.

This strucure is supposed to be loosely based on biological structures, such as the brain, where information from one neuron firing causes other neurons to fire.

As mentioned this is a Feed Forward Neural Network, however there are other network architectures that are used for various tasks, some basic examples are:

Other Types of Network Architectures:
    - Convolutional Neural Network - Image Processing
    - Recurrent Neural Networks - Time Series Data
    - Graph Neural Network - 

However, there are variations of each of these, and developing new archictures is a very active research field, where the balance between model acuracy and computataional expense are being ...

#### Feature recognition

Now how would the network learn these features, before jumping into the mathematics of this process, lets think of how a human would learn identifiy a '9' for example. One may first identify the loop at the top, and secondarily the the line that comes down from that loop. Of course when also shown another image of a 9, a human can still indentify that the image is indeed of a 9, eventhough the image is not quite the same as the fist, where the loop is different and the line is now curved. When training out network, we hope that it too would be invarient to these small differences and would not classify based on exact locations of pixels within the image domian, but use structres, or features, within the image that are learnt to corespond to the image label.

This is where the notion of layers comes in ... not specific to images here, this could be a time series corresponding to speech, where raw audo must be broken down into district sounds that together form syllables which in turn form words... breaks down into layers of abstraction ...

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28, 1]),
    tf.keras.layers.Dense(200, activation="relu"),
    tf.keras.layers.Dense(60, activation="relu"),
    tf.keras.layers.Dense(10, activation='softmax')]) # classifying into 10 classes

print(model.summary())

#### parameters - weights and biases

#### activation functions

Relu, softmax and sigmoid

#### maths - matrix

## How does it 'learn'?

#### Cost function

Cross entropy loss

#### how do we find minima - high dimensional space

#### Gradient decent (stochastic?)

#### local minima

step size - learning 

"Learning rate": You cannot update your weights and biases by the whole length of the gradient at each iteration. It would be like trying to get to the bottom of a valley while wearing seven-league boots. You would be jumping from one side of the valley to the other. To get to the bottom, you need to do smaller steps, i.e. use only a fraction of the gradient, typically in the 1/1000th range. This fraction is called the "learning rate".



#### Update parameters - backpropogation

- Too expensive to conduct this for each training example for each gradient decent step.
- shuffle and divide into mini batches - compute step according to the mini batch, this also better represents the constraints imposed by different example images and is therefore likely to converge towards the solution faster.
- The size of this mini batch is an adjustable parameter.
- This technique, sometimes called "stochastic gradient descent" has another, more pragmatic benefit: working with batches also means working with larger matrices and these are usually easier to optimise on GPUs and TPUs.

-momentum

-explodiing and vanishing gradients

The cross-entropy formula involves a logarithm and log(0) is Not a Number (NaN, a numerical crash if you prefer). Can the input to the cross-entropy be 0? The input comes from softmax which is essentially an exponential and an exponential is never zero. So we are safe!

Really? In the beautiful world of mathematics, we would be safe, but in the computer world, exp(-150), represented in float32 format, is as ZERO as it gets and the cross-entropy crashes.

Fortunately, there is nothing for you to do here either, since Keras takes care of this and computes softmax followed by the cross-entropy in an especially careful way to ensure numerical stability and avoid the dreaded NaNs.

## Optimisation

Configuring the model is done in Keras using the model.compile function. Here we use the basic optimizer 'sgd' (Stochastic Gradient Descent). A classification model requires a cross-entropy loss function, called 'categorical_crossentropy' in Keras. Finally, we ask the model to compute the 'accuracy' metric, which is the percentage of correctly classified images.

In [None]:
# this configures the training of the model. Keras calls it "compiling" the model.
model.compile(
    optimizer='adam',
    loss= 'categorical_crossentropy',
    metrics=['accuracy']) # % of correct answers


#  Train Model

In [None]:
model.fit(X_train, Y_train,
          batch_size=128, epochs=5,
          verbose=1)

# Evaluation

 and with that information, will output an array of values analogously correspond to the network's confidence in.

 Here the output will be that array of 10 values, that correspond 'anaolgously' to the network's confidence in classifying the image as that value. I.e., as can be seen here, the network, attibutes the largest value to the number 'N' - which can be interpreteed as the network 'guessing' that this image is a number 'N', while, with a small probability, ...

In [None]:
score = model.evaluate(X_test, Y_test)
print('Test score:', score[0])
print('Test accuracy:', score[1])

In [None]:
# The predict_classes function outputs the highest probability class according to the trained classifier for each input example.
predicted_classes = model.predict(X_test)

# Check which items we got right / wrong
correct_indices, incorrect_indices = [], []
for index, prediction_array in enumerate(predicted_classes):
    prediction = np.argmax(prediction_array)  #use the largest value of the array as the network's prediction
    Y_label    = np.argmax(Y_test[index])     #use the largest value of the array as the truth label
    if prediction == Y_label:   correct_indices.append(index)
    else:                     incorrect_indices.append(index)

plt.figure(figsize=(9,9))   
for index, correct in enumerate(correct_indices[:9]):
    plt.subplot(3,3,index+1)
    plt.imshow(X_test[correct].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Predicted {}, Class {}".format(np.argmax(predicted_classes[correct]), y_test[correct]))
plt.show()
    
plt.figure(figsize=(9,9))
for index, incorrect in enumerate(incorrect_indices[:9]):
    plt.subplot(3,3,index+1)
    plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Predicted {}, Class {}".format(np.argmax(predicted_classes[incorrect]), y_test[incorrect]))
plt.show()

# Art vs Science - Experimentation

How does increasing the batch size to 10,000 affect the training time and test accuracy?

How about a batch size of 32?

## Overfitting

## Regularisation

Drop Out

## Learning rate decay

# Conv Nets?