3D Convolutional Neural Network with tflearn
----------------------------------------------------------------

This template does not include pre-processing. Our dimensions will be height, width, and time.
We will be using tflearn (abstraction of machine learning library tensorflow).

In [None]:
# Importing libraries 
import numpy as np
import tflearn
from tflearn.layers.conv import conv_3d, max_pool_3d
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression


### Defining constants

Across all of our data, we will need to have the same dimensions of image size and time. These can be modified during preprocessing, but at this step, the input dimensions will be fixed. 

We also have constants that we can adjust in our model - learning rate, number of convolutional layers, and number of pooling layers. If there are fewer pooling layers than convolutional layers, the pooling layers will be inserted periodically between to convolutional layers. For example, if there is a 3:1 ratio of convolutional layers to pooling layers, there will be one pooling layer inserted at the same position per three convolutional layers. 

In [None]:
# Defining constants
IMG_SIZE = 50 # 50 x 50 image, for example
TIME_LEN = 5 # 5 time points, for example
LR = 1e-3 # learning rate, for example
CONV_LAYERS = 1
POOL_LAYERS = 1
NO_CHANNELS = 1 # how many input streams i.e. a network with 3 channels could have RGB, depth, and body skeleton streams
VERSION = 1 # to keep track of changes in a particular model
NO_SIGNS = 26 # number of signs to learn - i.e. the alphabet
N_EPOCH = 10 # number of epochs
MODEL_NAME = '3dconv-{}--{}--{}--{}--{}-.model'.format(LR, CONV_LAYERS, POOL_LAYERS, NO_CHANNELS, VERSION)

### Building our neural network
#### Input layer

Our first layer will be the input layer. The tflearn function to create the input layer is:
```
input_data(shape = [batch, depth, height, width, number_of_channels], name = 'optional_name_of_layer')

```
The batch parameter should be `None`.

#### Convolutional layer

We then need to create our first 3D convolutional layer. The tflearn funtion to create the 3D convolutional layer is:
```
conv_3d(input, number_of_filters, filter_size, strides = number_of_strides, padding = 'padding', activation = 'activation_function', bias = True or False, weights_init = 'initial_weights', bias_init = 'initial bias', regularizer = None, weight_decay = weight_decay, trainable = True or False, restore = True or False, reuse = True or False, scope = 'optional_layer_scope', name = 'optional_name_of_layer')
```

Parameter definitions
* **input**: 5D tensor with dimensions \[batch, depth, height, width, number_of_channels\]
    * takes the output from a previous layer - for example the input layer
* **number_of_filters**: an integer specifying the number of convolutional filters
    * the depth of the output is given by the number of filters
* **filter_size**: an integer or list of integers indicating the size of the filter
    * for a filter size of 2x2x2, one could enter 2, or \[2 2 2\]
    * in 2D convolutional neural networks, common filter sizes are 2x2 and 3x3
    * the best 3D CNN size is 3x3x3
* **strides**: an integer or list of integers indicating the number of strides of the filter
    * default = \[1 1 1 1 1\]: 1x1x1 stride, can also be written as 1
    * for a stride of 2x2x2, one can write \[2 2 2 2 2\], or simply 2
    * the most common stride is 1
* **padding**: either 'same' or 'valid'
    * default = 'same'
    * 'same' padding: pads the input with zeros so that the width and height dimensions of the input and output are the same
    * 'valid' padding: there is no padding, and the width and height dimensions of the output are smaller
    * 'same' is most commonly used
* **activation**: either 'linear', 'tanh', 'sigmoid', 'softmax', 'softplus', 'softsign', 'relu', 'relu6', 'leaky_relu', 'prelu', 'elu', 'crelu', or 'selu'
    * 'linear' by default
    * the most common activation used in a convolutional layer is 'relu'
    * 'linear': computes f(x) = x
        * used when a straight line is a good approximation of the relationship between input and output
        * works for linear regression analysis, but has very poor performance in convolutional neural networks
    * 'tanh': computes the hyperbolic tangent tanh(x) element-wise 
    * 'sigmoid': computes the sigmoid function sigmoid(x) element-wise
    * 'softmax': computes softmax\[i, j\] = exp(logits\[i, j\]) / sum(exp(logits\[i\])) for each batch i and class j
        * a useful activation function for the fully connected layer 
    * 'softplus': computes log(exp(features) + 1)
    * 'softsign': computes features / (abs(features) + 1)
    * 'relu': computes the rectified linear function max(features, 0)
    * 'relu6': computes min(relu, 6) = min(max(features, 0), 6)
    * other activation functions:
        * 'leaky_relu'
        * 'prelu'
        * 'elu'
        * 'crelu'
        * 'selu'
* **bias**: either True or False
    * indicates whether to use a bias or not
    * default = True
* **weights_init**: either 'zeros', 'uniform', 'uniform_scaling', 'normal', 'truncated_normal', 'xavier', or 'variance_scaling' 
    * default = 'truncated_normal'
    * 'zeros': sets all weights to zero
    * 'uniform': sets all weights to random values from a uniform distribution
    * 'uniform_scaling': sets all weights to random values from a uniform distribution without scaling variance
        * it is better to initialise the weights with uniform_scaling rather than uniform to keep the scale of the input variance constant
    * 'normal': sets all weights to random values from a normal distribution
    * 'truncated_normal': sets all weights to random values from a normal distribution with a specified mean and standard deviation
        * values more than 2 standard deviations from the mean are dropped and re-picked
        * the default mean is 0.0, and the default standard deviation is 0.02
    * 'xavier': performs "Xavier" initialisation for weights
        * designed to keep the scale of the gradients roughly the same in all layers
        * http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
    * 'variance_scaling': initialises weights without scaling variance, has different modes
* **bias_init**: either 'zeros', 'uniform', 'uniform_scaling', 'normal', 'truncated_normal', 'xavier', or 'variance_scaling'
    * the initialisers (zeros, uniform, etc.) work the same as they do for the weights
    * the default is zeros - this is a good starting point for the biases
* **regularizer**: adds a regulariser to the weights
    * by default regularizer = None 
* **weight_decay**: specifies weight decay of regulariser
    * can be ignored unless we decide to modify the regulariser
    * by default 0.001
* **trainable**: indicates whether the weights are trainable - either True or False
    * by default True, which is what we want
* **restore**: indicates whether the layer weights should be restored when loading the model - either True or False
    * by default True, which is what we want
* **reuse**: indiates whether the variables in the layer should be shared with other layers in the scope
    * by default False, we can ignore this unless we would like to share variables across layers
* **scope**: optionally defines the layer's scope - used for sharing variables across layers
* **name**: an optional name for the layer
    * default = 'Conv3D'

The main parameters we will be interested in changing are input, number of filters, filter size, strides, and activation. 

In 2 dimensions, there are some important equations to keep in mind when changing parameters:
* output width = (input width - filter size + 2 * padding)/strides + 1
* output height = (input height - filter size + 2 * padding)/strides + 1
* output depth = number of filters

By setting padding to 'same', it ensures that the output and input width and height dimensions are the same by adjusting the padding parameter above.

#### Pooling layer

After one or more convolutional layers, we can add a pooling layer. In some cases, reducing pooling can improve results. Pooling does, however, have the advantage of reducing size. It is still generally convention to use pooling layers.

The tflearn function for pooling in 3D is:
```
max_pool_3d(input, kernel_size, strides = number_of_strides, padding = 'padding', name = 'optional_name_of_layer')
```

Parameter definitions:
* **input**: 5D tensor with dimensions \[batch, depth, height, width, number_of_channels\]
    * takes the output from a previous layer - for example a convolutional layer
* **kernel_size**: an integer or list of integers indicating the pooling kernel size 
    * default = 1, also written as \[1 1 1\] for a 1x1 kernel
    * the most common kernel size is 2 (2x2x2 kernel)
* **strides**: an integer or list of integers indicating the number of strides of the kernel
    * default = \[1 1 1 1 1\]: 1x1 stride, can also be written as 1
    * for a stride of 2x2, one can write \[2 2 2 2 2\], or simply 2
    * the most common stride is 2
* **padding**: either 'same' or 'valid'
    * default = 'same'
* **name**: an optional name for the layer
    * default = 'MaxPool3D'

In two dimensions there are some important equations to keep in mind when changing parameters:
* output width = (input width - kernel size)/strides + 1
* output height = (input height - kernel size)/strides + 1
* output depth = input depth

Two common settings for the parameters are (in 2 dimensions)
* most common: filter size = 2, strides = 2
* overlapping pooling: filter size = 3, strides = 2

#### Fully connected layer

We will have two or three fully connected layers. The first fully connected layer will connect to the last convolutional or pooling layer. The number of neurons (units) in the first fully connected layer is generally the number of filters in the previous convolutional layer multiplied by 8. The first layer generally has an rectified linear ('relu') or hyberbolic tangent ('tanh') activation function. If there are three fully connected layers, the second fully connected layer will have the same number of units and activation function as the first. The last layer will have the same number of neurons as the different classifications - i.e. if we are classifying the letters of the alphabet, there will be 26 neurons. The activation function of the last layer is 'softmax' in all of the examples I have seen. There is sometimes dropout between the fully connected layers. This will be discussed further in the dropout section.

The tflearn function for the fully connected layer is:
```
fully_connected(input, number_of_units, activation = 'activation_function', bias = True or False, weights_init = 'initial_weights', bias_init = 'initial bias', regularizer = None, weight_decay = weight_decay, trainable = True or False, restore = True or False, reuse = True or False, scope = 'optional_layer_scope', name = 'optional_name_of_layer')
```

Parameter definitions:
* **input**: 5D tensor with dimensions \[batch, depth, height, width, number_of_channels\] flattened into a 2D tensor
    * * takes and flattens the output from a previous layer - for example a pooling layer
* **number_of_units**: integer indicating the number of neurons in the fully connected layer
    * fully connected layers before the last layer have an arbitrary number of neurons - but it is good to follow some standards (i.e. number of neurons = number of filters in previous conv layer * 8)
    * last fully connected layer has as many neurons as classes (i.e. 26 - see above)
* **activation**: see activation definition above under 'Convolutional Layer'
    * for the fully connected layers excluding the last layer, the activation function is generally the same as the activation function used in the convolutional layers
    * for the last fully connected layer, the activation function is usually 'softmax'
* **bias**: either True or False
    * indicates whether to use a bias or not
    * default = True
* **weights_init**: see weights_init definition above under 'Convolutional Layer'
* **bias_init**: see bias_init definition above under 'Convolutional Layer'
* **regularizer**: adds a regulariser to the weights
    * by default regularizer = None 
* **weight_decay**: specifies weight decay of regulariser
    * can be ignored unless we decide to modify the regulariser
    * by default 0.001
* **trainable**: indicates whether the weights are trainable - either True or False
    * by default True, which is what we want
* **restore**: indicates whether the layer weights should be restored when loading the model - either True or False
    * by default True, which is what we want
* **reuse**: indiates whether the variables in the layer should be shared with other layers in the scope
    * by default False, we can ignore this unless we would like to share variables across layers
* **scope**: optionally defines the layer's scope - used for sharing variables across layers
* **name**: an optional name for the layer
    * default = 'FullyConnected'
    
#### Dropout layer
The dropout layer is an optional layer inserted between fully connected layers. A dropout layer ignores some randomly selected neurons, meaning that they will not be considered in a particular backward or forward pass. At each training stage, the randomly selected neurons are dropped out of the network. Dropout is used to prevent overfitting. Overfitting is when the neurons in a layer become dependent on each other during training, reducing their individual power. The number of neurons dropped is determined by the keep probability. In many cases 0.8 is the best keep probability, but many networks also use a keep probability of 0.5. 

The tflearn function for the dropout layer is:
```
dropout(input, keep_probability, noise_shape=None, name='optional_layer_name')
```

Parameter definitions:
* **input**: a tensor
    * the 2D tensor output of a fully connected layer (dimensions = \[samples, number_of_units\])
* **keep_probability**: a float representing the probability that each neuron is kept
* **noise_shape**: a one dimensional tensor, representing the shape of the neurons kept and dropped
    * by default, each element is kept or dropped independently
    * if noise shape is specified, only dimensions with noise_shape\[i\] = shape(input)\[i\] will be kept or dropped independently
    * the documentation's example: if shape(input) = \[k, l, m, n\] and noise_shape = \[k, 1, 1, n\], each batch and channel component will be kept independently and each row and column will be kept or not kept together
    * at this point we will ignore noise shape
* **name**: an optional name for the layer
    * default = 'Dropout'
    
#### Regression layer

The regression layer is used to apply linear or logistic regression to the output of the convolutional neural network (the final fully connected layer). A gradient descent optimiser needs to be specified that will minimise the loss function (also needs to be specified). The regression layer works in specifying *how* the convolutional network is trained, rather than being considered a layer of the convolutional neural network.

The tflearn function for the regression layer is:
```
regression(input, optimizer = 'gradient_descent_optimiser', loss='loss_function', metric = 'metric_used', learning_rate = learning_rate, dtype=tf.float32, batch_size = batch_size, shuffle_batches = True or False, to_one_hot = True or False, n_classes = None, trainable_vars = None, restore = True or False, op_name = None, validation_monitors = None, validation_batch_size = None, name = 'optional_name_of_layer')
```

Parameter definitions:
* **input**: 2-dimensional tensor from the last fully connected layer
* **optimizer**: either 'sgd', 'rmsprop', 'adam', 'momentum', 'adagrad', 'ftrl', 'adadelta', 'proxi_adagrad', or 'nesterov'
    * 'sgd': uses stochastic gradient descent
        * the basic optimiser for neural networks
        * can specify learning rate decay - by default none, but it's often recommended to lower the learning rate as training progresses
        * to specify learning rate decay: 
            * `sgd = tflearn.optimizers.SGD(learning_rate=LR, lr_decay=0.5, decay_step=100, staircase=False, use_locking=False, name='SGD') `
            * `conv_net = regression(conv_net, optimizer = sgd) `
    * 'rmsprop': maintain a moving average of the square of gradients, and divide the gradient by the root of this average
    * 'adam': calculates an exponential moving average of the gradient and the squared gradient
        * parameters beta1 and beta2 control the decay rates of these moving averages
        * combines the advantages of the adaptive gradient algorithm (adagrad) and root mean square propagation (rmsprop)
        * most popular optimiser for CNNs
    * 'momentum': helps accelerate stochastic gradient descent in the relevant direction and dampens oscillations
        * can also specify learning rate decay - by default none
    * 'adagrad': improves stochastic gradient descent by adapting the learning rate to the parameters
        * decreases learning rates for parameters associated with commonly occurring features
        * increases learning rates for parameters associated with infrequently occurring features
    * 'ftrl': implements the Ftrl-proximal algorithm (Follow-the-regularized-leader)
        * uses its own global base learning rate and can behave like Adagrad with learning_rate_power=-0.5, or like gradient descent with learning_rate_power=0.0.
        * how to change learning power:
            * `ftrl = tflearn.optimizers.Ftrl(learning_rate=0.01, learning_rate_power=-0.1)`
            * `conv_net = regression(conv_net, optimizer = ftrl)`
    * 'adadelta': extension of adagrad that attempts to reduce the aggressiveness of adagrad decrease in learning rate
        * instead of storing all past squared gradients, adadelta restricts the window of stored past squared gradients to some fixed size w
    * 'proxi_adagrad': see http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf
    * 'nesterov': very similar to momentum optimiser, but calculates the gradient at the approximated new position rather than the current position
* **loss**: either 'softmax_categorical_crossentropy', 'categorical_crossentropy', 'binary_crossentropy', 'weighted_crossentropy', 'mean_square', 'hinge_loss', 'roc_auc_score', or 'weak_cross_entropy_2d'
    * note: in the tflearn the loss functions are under objectives, and NOT losses
    * default: 'categorical_crossentropy'
    * 'softmax_categorical_crossentropy': computes softmax cross entropy between the predicted output value and the actual output value
    * 'categorical_crossentropy': computes cross entropy between the predicted output value and the actual output value
        * we will be using this for now
    * 'binary_crossentropy': computes sigmoid cross entropy between the predicted output value and the actual output value
    * 'weighted_crossentropy': computes a weighted sigmoid cross entropy between the predicted output value and the actual output value
    * 'mean_square': computes the mean square error between the predicted output value and the true output value
    * 'hinge_loss': computes the loss using the hinge loss function 
    * 'roc_auc_score': approximates the area under the curve score, measuring overall performance for a full range of threshold levels
    * 'weak_cross_entropy_2d': calculates the semantic segmentation using weak softmax cross entropy loss
* **metric**: either 'acc', 'top5', 'r2', 'weighted_r2'
    * default is 'acc'
    * 'acc': computes the accuracy of the model - this is what we will use
    * 'top5': computes top-k mean accuracy of the model
    * 'r2': computes the coefficient of determination - for linear regression
    * 'weighted_r2': also computes the coefficient of determination - for linear regression
* **learning_rate**: the optimiser's learning rate (the constant LR)
    * default is 0.001
    * we will be adjusting learning rate
    * if the learning rate is too small, gradient descent is very slow
    * if the learning rate is too high, gradient descent can overshoot the minimum
        * may fail to converge
    * a good range is 0.001 to 0.01
* **dtype**: layer's placeholder type
    * default = tf.float32
    * we can ignore this
* **batch_size**: batch size of data for training 
    * default = 64
    * we should be able to ignore this
* **shuffle_batches**: whether to shuffle the batches or not at each epoch
    * default = True
    * we will always want this to be true
* **to_one_hot**: if True, labels will be encoded to one hot vectors
    * default = False
    * we should be able to ignore this
* **n_classes**: only applicable if to_one_hot = True
    * default = None
    * we should be able to ignore this 
* **trainable_vars**: list of variables, that if specified, will be the only variables that have updated weights
    * otherwise all variables will be updated - what we want
    * leave this
* **restore**: determines whether variables related to the optimiser will be restored when loading a model
    * default = True - what we want
    * leave this
* **op_name**: optional name for layer's optimiser
* **validation_monitors**: list of variables to compute during validation
    * default = None
    * we should be able to ignore this
* **validation_batch_size**: specifies batch size for validation
    * default = None
    * we should be able to ignore this
* **name**: optional name for layer

In [None]:
conv_net = input_data(shape = [None, TIME_LEN, IMG_SIZE, IMG_SIZE, NO_CHANNELS], name = 'input')

conv_net = conv_3d(conv_net, 32, 3, strides = 1, padding = 'same', activation = 'relu')

conv_net = max_pool_3d(conv_net, 2, strides = 2)

conv_net = fully_connected(conv_net, 256, activation = 'relu')

conv_net = dropout(conv_net, 0.8)

conv_net = fully_connected(conv_net, NO_SIGNS, activation = 'softmax')

conv_net = regression(conv_net, optimizer = 'adam', learning_rate = LR, loss = 'categorical_crossentropy', name = 'targets')


### Creating and training our model
#### Creating the model
```
tflearn.DNN(network, clip_gradients = gradient_clipping_float, tensorboard_verbose = tensorboard_verbose_level_integer, tensorboard_dir = 'directory_to_store_tensorboard_logs', checkpoint_path = None, best_checkpoint_path = None, max_checkpoints = None, session = None, best_val_accuracy = minimum_accuracy_for_best_checkpoint_path)
```
Parameter definitions:

* **network**: our fully built neural network 
* **clip_gradients**: determines the threshold that gradients should be clipped to
    * default = 5.0
    * this prevents the gradient value from getting to large and causing overflow or causing the model to overshoot the minima
    * documentation is farily scarce on this, but I think it's safe to just ignore this
* **tensorboard_verbose**: tensorboard summary verbose level (0, 1, 2, or 3)
    * 0: Loss, Accuracy
    * 1: Loss, Accuracy, Gradients
    * 2: Loss, Accuracy, Gradients, Weights
    * 3: Loss, Accuracy, Gradients, Weights, Activations, Sparsity
    * default = 0
* **tensorboard_dir**: file directory to store tensorboard logs in 
* **checkpoint_path**: path to store model checkpoints in
    * if None, no model checkpoints will be saved
    * default = None
* **best_checkpoint_path**: path to store the checkpoint where the validation rate reaches the highest point in the session
    * validation rate must also be above the best_val_accuracy
    * default = None
* **max_checkpoints**: maximum number of checkpoints
    * if None, no limit
    * default = None
* **session**: a session for running operations
    * if None, a new session will be created
    * when providing a session, variables must have already been initialised
    * default = None (what we want)
* **best_val_accuracy**: the minimum validation accuracy needed to save a checkpoint in best_checkpoint_path
    * default = 0.0
    

In [None]:
model = tflearn.DNN(conv_net, tensorboard_dir = 'log')

In [None]:
# If you have already saved a checkpoint - network already trained
if os.path.exists('{}.meta'.format(MODEL_NAME)):
    model.load(MODEL_NAME)
    print('model loaded')

In [None]:
# to reset tensorboard

# import tensorflow as tf
# tf.reset_default_graph()

#### Fitting the model

This trains the model using inputs (X) and targets (Y).

```
model.fit(X_inputs, Y_targets n_epoch = number_of_epochs, validation_set = ({'name_of_input_layer':test_inputs}, {'name_of_estimator_layer':test_targets}), show_metric = True or False, batch_size = None, shuffle = None, snapshot_epoch = True or False, snapshot_step = steps_between_snapshots, excl_trainops = None, validation_batch_size = None, run_id = None, callbacks = [])
```
Parameter definitions:

* **X_inputs**: the features (i.e. videos) in an array, list of arrays, or dictionary
    * our data will be in a dictionary - {'name_of_input_layer':X}
        * the name of the input layer is the keys 
        * X is a numpy array of features (i.e. videos) 
* **Y_targets**: the labels (i.e. which letter the video shows) in an array, list of arrays, or dictionary
    * our data will be in a dictionary - {'name_of_estimator_layer':Y}
        * the name of the estimator layer (for us the estimator layer is the regression layer) is the keys 
        * Y is a numpy array of labels 
* **n_epoch**: the number of epochs to run
    * default = 10
* **validation_set**: test features and labels to determine the accuracy of the model
    * input and targets are shown in the dictionary format, but can also be in the other formats that X_inputs and Y_targets are in
    * default = None
* **show_metric**: whether to display accuracy at each step
    * True = display accuracy
    * default = False
* **batch_size**: if an integer (not None), overrides the estimator's batch size by the integer value
    * default = None
    * we can ignore this
* **shuffle**: True or False or None - if True or False, overrides the estimator's shuffle parameter by True or False
    * default = None
    * we can ignore this
* **snapshot_epoch**: True or False
    * if True, snapshots the model (evaluating it with the validation set) at the end of every epoch
    * default = True
* **snapshot_step**: if an integer (not None), the model takes a snapshot every integer value steps
    * default = None
    * it is useful to set this to an integer value if the epochs take a long time to run or the batches are large
* **excl_trainops**: a list of training operations to exclude from the training process
    * default = None
    * we can ignore this
* **validation_batch_size**: if an integer (not None), overrides the estimator's validation batch size by the integer value
    * default = None
    * we can ignore this
* **run_id**: name for the run
    * default = None
    * it may be useful to change this when we're using tensorboard
* **callbacks**: custom callbacks to use in training
    * ignore this

In [None]:
model.fit({'input':X}, {'targets':Y}, n_epoch = N_EPOCH, validation_set = ({'input':test_X}, {'targets':test_Y}), 
          snapshot_step = 50, show_metric = True, run_id = MODEL_NAME) 


### Using tensorboard
Copy and paste this into command prompt:
```
tensorboard --logdir = foo:C:/Users/gabri/Documents/Python Scripts/Project/3D CNN/log
```
Then go to the link tensorboard displays in command prompt.