## Neural Networks: Keras

In [None]:
from keras import *
from keras.models import Sequential # Usual NN with several layers
from keras.layers import Dense # fully connected NN (all weights there)

#### Define a NN with 
    1 input layer - 2 neurons
    3 hidden layers - 150 neurons / 150 neurons / 100 neurons
    1 output layer - 1 neuron

In [None]:
net=Sequential() 
net.add(Dense(150, input_shape=(2,),activation='relu')) # Input shape = number of neurons
net.add(Dense(150, activation='relu'))
net.add(Dense(100, activation='relu'))
net.add(Dense(1, activation='relu'))

net.compile(loss='mean_squared_error', 
            optimizer=optimizers.SGD(lr=0.1),
            metrics=['accuracy'])
# Makebatch
#def make_batch()
#    y_in=...
#    y_target=...

batchsize=20
batches=200
costs=zeros(batches)

for k in range(batches):
    y_in,y_target=make_batch()   # y_in dim: batchsize x 2 / y_target dim: batchsize x 1
    costs[k]=net.train_on_batch(y_in,y_target)[0]
# It returns some numbers that tell you how well you are training,
# the first of these numbers [0] is actually the value of the cost function
# at that training moment. I keep track of the learning progress by saving the values on costs[]

y_out=net.predict_on_batch(y_in)

#### Handwritting recognition (MNIST)

    - distinguish categories
    - softmax nonlinearity for probability distributions
    - categorical cross-entropy cost function
    - training/validation/test data
    - overfitting and some solutions
    
Input a 28x28 image = 784 gray values -> NN -> output the category classification 'one hot encoding'

Since we have 10 different handwritten digits (from 0 to 9), we have 10 output neurons: i.e.,

           Neuron responsible of number (#)
                     - 0 (0)    - reality -  0.1
                     - 0 (1)    - reality -  0
                     - 0 (2)    - reality -  0
                     - 0 (3)    - reality -  0
                     - 0 (4)    - reality -  0.1
    input: 6 ->  NN
                     - 0 (5)    - reality -  0.1
                     - 1 (6)    - reality -  0.7
                     - 0 (7)    - reality -  0
                     - 0 (8)    - reality -  0
                     - 0 (9)    - reality -  0
                     
* 'One-hot encoding' = only ONE neuron is activated (hot) and all the others are 0, i.e., 0000001000, etc.

Input consists on thousands of image pixels; the output is the category that tells me what does this image represents.  

**Probabilities should always be normalized: the sum of all the output neurons should be = 1** 
    
    -> last hidden layer to output layer to make sure normalization is done
    -> to do so: last hidden layer to output >> non-linear function that depends on all values simultaneously!
    -> MULTIVARIABLE GENERALIZATION OF SIGMOID FUNCTION: SOFTMAX activation function

$\qquad \qquad \qquad \qquad \qquad \qquad f_j(z_1,z_2,...)=\frac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}$

In [None]:
net.add(Dense(10, activation='softmax'))

**Cost function**

For any probability distribution (S is non-negative, additive for factorizable distributions): $S=-\sum_jp_jln p_j$


Categorical cross-entropy cost function - compare two distributions; 'y' are probabilities: $C=-\sum_jy_i^{target}lny_j^{out}$, 

being $y_j^{target}=F_j(y^{in})$ the desired 'one-hot' classification -in handwritting MNIST case-

An advantage of using this function is that its derivative doesn't get exponentially small when one neuron is very close to 1 and the others to 0.

$f_j(z_1,z_2,...)=\frac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}$ -> $\frac{\partial}{\partial w}ln f_j(z) = \frac{\partial z_j}{\partial w} - \frac{\sum_k \frac{\partial z_k}{\partial w}e^{z_k}}{\sum_k e^{z_k}}$

Conversely, for quadratic cost function: 

$f_j(z_1,z_2,...)=\frac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}$ -> $\frac{\partial}{\partial w}\sum_j(f_j(z)-y_j^{target})^2=2\sum_j(f_j(z)-y_j^{target})\frac{\partial f_j(z)}{\partial w}$

Training may get stuck for a long time -> slope becomes exponentially small as you go for large values.

In [None]:
net.compile(loss='categorical_crossentropy',optimize=optimizers.SGD(lr=1.0),metrics=['categorical_accuracy'])

##### Training on MNIST images

    training_inputs  : array num_samples x numpixels
    training_results : array num_samples x 10 ('onehot')
    
**One epoch means training at once ALL 50000 training images feed them into net in batches of size 100**, here we do 30 epochs. 

In [None]:
history=net.fit(training_inputs, training_results, batch_size=100, epochs=30)

*Accuracy during training may be very high (97%, i.e., only <3% error): how many you recognize in the right manner. However, when we test, about the 7% are labeled incorrectly -> assessing accuracy on the training set may yield results too optimistic -> compare with samples which are not used for training (test set).*

    [VALIDATION SET](5000 images)*
    [ TRAINING SET ](45000 images)
    ----------------
    [   TEST SET   ](10000 images)

**idea: use cross-validation in training set to build validation*

**IF ACCURACY vs EPOCHS TENDENCY (DURING VALIDATION) DECREASES -> OVERFITTING: NN memorizes the training samples -> it cannot generalize to unfamiliar data**

    - ALWAYS measure accuracy against validation data, independent of training data
    - Stop after reaching maximum in validation accuracy
    - Generate fresh training data by distorting existing images (transformations: rotations, scale up, etc.)
    - Dropout -> set to zero random neuron values during training, such that the network has to cope with that noise and never learns too much detail.

## Convolutional networks:

Translational invariance -> different image meaning the same (e.g., image of a 9 but moved)

    Convolutions:
$ \qquad F^{new}(x)=\int K(x-x')F(x')dx'$ , being $K$ the kernel -only depends on the difference of coordinates- 
    
    Convolutional layer:
    
    For a kernel of size 3, we have three weights w1, w2 and w3 which define the filter. SAME weight values for different neurons in same layer. Store only 3 values of the weights. Is essentially to scan the kernel over the original image: we learn the kernel weights. 

- Exploit translational invariance: features learned in one part of an image will be automatically recognized in different parts. 

- Drastic reduction of the number of weights stored: when fully connected ($N^2$, being N the size of layer/image), for a convolutional (M, being the size of the kernel). 

- It is independent of the size of the image: lower memory consumption, oproved speed. 

