## Neural Networks: Keras

In [None]:
from keras import *
from keras.models import Sequential # Usual NN with several layers
from keras.layers import Dense # fully connected NN (all weights there)

## SUPERVISED LEARNING: 

#### Define a NN with 
    1 input layer - 2 neurons
    3 hidden layers - 150 neurons / 150 neurons / 100 neurons
    1 output layer - 1 neuron

In [None]:
net=Sequential() 
net.add(Dense(150, input_shape=(2,),activation='relu')) # Input shape = number of neurons
net.add(Dense(150, activation='relu'))
net.add(Dense(100, activation='relu'))
net.add(Dense(1, activation='relu'))

net.compile(loss='mean_squared_error', 
            optimizer=optimizers.SGD(lr=0.1),
            metrics=['accuracy'])
# Makebatch
#def make_batch()
#    y_in=...
#    y_target=...

batchsize=20
batches=200
costs=zeros(batches)

for k in range(batches):
    y_in,y_target=make_batch()   # y_in dim: batchsize x 2 / y_target dim: batchsize x 1
    costs[k]=net.train_on_batch(y_in,y_target)[0]
# It returns some numbers that tell you how well you are training,
# the first of these numbers [0] is actually the value of the cost function
# at that training moment. I keep track of the learning progress by saving the values on costs[]

y_out=net.predict_on_batch(y_in)

#### Handwritting recognition (MNIST)

    - distinguish categories
    - softmax nonlinearity for probability distributions
    - categorical cross-entropy cost function
    - training/validation/test data
    - overfitting and some solutions
    
Input a 28x28 image = 784 gray values -> NN -> output the category classification 'one hot encoding'

Since we have 10 different handwritten digits (from 0 to 9), we have 10 output neurons: i.e.,

           Neuron responsible of number (#)
                     - 0 (0)    - reality -  0.1
                     - 0 (1)    - reality -  0
                     - 0 (2)    - reality -  0
                     - 0 (3)    - reality -  0
                     - 0 (4)    - reality -  0.1
    input: 6 ->  NN
                     - 0 (5)    - reality -  0.1
                     - 1 (6)    - reality -  0.7
                     - 0 (7)    - reality -  0
                     - 0 (8)    - reality -  0
                     - 0 (9)    - reality -  0
                     
* 'One-hot encoding' = only ONE neuron is activated (hot) and all the others are 0, i.e., 0000001000, etc.

Input consists on thousands of image pixels; the output is the category that tells me what does this image represents.  

**Probabilities should always be normalized: the sum of all the output neurons should be = 1** 
    
    -> last hidden layer to output layer to make sure normalization is done
    -> to do so: last hidden layer to output >> non-linear function that depends on all values simultaneously!
    -> MULTIVARIABLE GENERALIZATION OF SIGMOID FUNCTION: SOFTMAX activation function

$\qquad \qquad \qquad \qquad \qquad \qquad f_j(z_1,z_2,...)=\frac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}$

In [None]:
net.add(Dense(10, activation='softmax'))

**Cost function**

For any probability distribution (S is non-negative, additive for factorizable distributions): $S=-\sum_jp_jln p_j$


Categorical cross-entropy cost function - compare two distributions; 'y' are probabilities: $C=-\sum_jy_i^{target}lny_j^{out}$, 

being $y_j^{target}=F_j(y^{in})$ the desired 'one-hot' classification -in handwritting MNIST case-

An advantage of using this function is that its derivative doesn't get exponentially small when one neuron is very close to 1 and the others to 0.

$f_j(z_1,z_2,...)=\frac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}$ -> $\frac{\partial}{\partial w}ln f_j(z) = \frac{\partial z_j}{\partial w} - \frac{\sum_k \frac{\partial z_k}{\partial w}e^{z_k}}{\sum_k e^{z_k}}$

Conversely, for quadratic cost function: 

$f_j(z_1,z_2,...)=\frac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}$ -> $\frac{\partial}{\partial w}\sum_j(f_j(z)-y_j^{target})^2=2\sum_j(f_j(z)-y_j^{target})\frac{\partial f_j(z)}{\partial w}$

Training may get stuck for a long time -> slope becomes exponentially small as you go for large values.

In [None]:
net.compile(loss='categorical_crossentropy',optimize=optimizers.SGD(lr=1.0),metrics=['categorical_accuracy'])

##### Training on MNIST images

    training_inputs  : array num_samples x numpixels
    training_results : array num_samples x 10 ('onehot')
    
**One epoch means training at once ALL 50000 training images feed them into net in batches of size 100**, here we do 30 epochs. 

In [None]:
history=net.fit(training_inputs, training_results, batch_size=100, epochs=30)

*Accuracy during training may be very high (97%, i.e., only <3% error): how many you recognize in the right manner. However, when we test, about the 7% are labeled incorrectly -> assessing accuracy on the training set may yield results too optimistic -> compare with samples which are not used for training (test set).*

    [VALIDATION SET](5000 images)*
    [ TRAINING SET ](45000 images)
    ----------------
    [   TEST SET   ](10000 images)

**idea: use cross-validation in training set to build validation*

**IF ACCURACY vs EPOCHS TENDENCY (DURING VALIDATION) DECREASES -> OVERFITTING: NN memorizes the training samples -> it cannot generalize to unfamiliar data**

    - ALWAYS measure accuracy against validation data, independent of training data
    - Stop after reaching maximum in validation accuracy
    - Generate fresh training data by distorting existing images (transformations: rotations, scale up, etc.)
    - Dropout -> set to zero random neuron values during training, such that the network has to cope with that noise and never learns too much detail.

## Convolutional networks:

Translational invariance -> different image meaning the same (e.g., image of a 9 but moved)

    Convolutions:
$ \qquad F^{new}(x)=\int K(x-x')F(x')dx'$ , being $K$ the kernel -only depends on the difference of coordinates- 
    
    Convolutional layer:
    
    For a kernel of size 3, we have three weights w1, w2 and w3 which define the filter. SAME weight values for different neurons in same layer. Store only 3 values of the weights. Is essentially to scan the kernel over the original image: we learn the kernel weights. 

- Exploit translational invariance: features learned in one part of an image will be automatically recognized in different parts. 

- Drastic reduction of the number of weights stored: when fully connected ($N^2$, being N the size of layer/image), for a convolutional (M, being the size of the kernel). 

- It is independent of the size of the image: lower memory consumption, improved speed. 

    Define several filters -> smoothing, contours, etc. -> several channels too
    
    Once we want to output from convolutional to one-hot encoder -> dense()

For a 2D convolutional layer, input a NxN image; only 1 channel (not RGB for instance)

About the channels: 
    For a 3-channeled MxM image -> conv -> 6 channels MxM image
    - In any output channel, each pixel receives input from KxK nearby pixels in any of the input channels (each of those input channel pixels regions is weigted by a different filter); contributions from all the input channels are linearly superimposed. 
    
    * Keras gets rid of this -> we only need to specify the # of channels.

In [None]:
net.add(Conv2D(input_shape=(N,N,1),
               filters=20, # From 1 input image I get 20 -filters- > next layer is NxNx20
               kernel_size=[11,11], # Influence zone aroung a given pixel 
               activation='relu',
               padding='same')) # What to do at borders: force image size to remain the same

We can reduce the resolution by subsampling: "max pooling" or "max pooling"

In [None]:
net.add(AveragePooling2D(pool_size=8))

Enlarging the image size: repeats values

In [None]:
net.add(UpSampling2D(size=8))

E.g. of handwritting conv net: 

    input 28x28 -> conv: 7 channels x 28x28 -> subsampling/4: 7x(7x7), output: dense(softmax)

In [None]:
# Initialize the conv NN; M=28 for 28x28 pixel images

def init_net_conv_simple():
    global net,M
    net = Sequential() # One layer after the other
    net.add(Conv2D(input_shape=(M,M,1), filters=7, # Convolutional layer for a 2D image 
                  kernel_size=[5,5], activation='relu', padding='same'))
    net.add(AveragePooling2D(pool_size=4)) # Subsample the image -lower resolution: 4 original pixels -> 1 pixel  
    
    net.add(Flatten()) #MUST for transition to dense layer: from a matrix to a simple vector
    
    net.add(Dense(10,activation='softmax')) # Fully connected NN
    net.compile(loss='categorical_crossentropy', optimizer=optimizers.SGD(lr=1.0),metrics=['categorical_accuracy'])

### Gabor filters: 

* 2D Gaussian function times sin-function
* Encodes orientation and spatial frequency
* Useful for feature extraction in images (for instance, detect lines or contours of certain orientation)
* Believe to be good approximation to first stage of image processing in visual cortex

**(!) Only if the boundary is oriented in the same manner, you get a signal. But if the boundary is orthogonal to the orientation of the function, cancellation applies.**

E.g. of handwritting 2-stage conv net: 

    input 28x28 -> conv: 8 channels x 28x28 -> subsampling/2: 8x(14x14) -> conv: 8x(14x14) -> subsampling 8x(7x7), output: dense(softmax) 
    
$\approx 90%$ error -> **Same NN with ADAPTIVE LEARNING RATE (e.g., ADAM) -> $\approx 1.7%$ error**

## UNSUPERVISED LEARNING:

    Large datasets without labels. Unsupervised: you don't provide the right answers. 
    
    - Extract the crucial features of a large class of training samples without any guidance :)
    
**(1) Autoencoder**: the goal is to reproduce the input at the output. To do so, feed information through some small intermediate layer - **bottleneck** -. The net must learn to extract the crucial features of the class of input images. 

    - Are useful for pretraining
    - But detailed reconstruction of the input may not be the best method to learn important abstract features
    - One may use the compressed representation for data visualizing higher-level features of the data

      input - bottleneck - output
      |____|              |______|
      encoder             decoder
    downsampling         upsampling 
    
    > Input - Convolutional NN structure - Downsampling - Bottleneck - Upsampling - Output
*-> We first reduce the resolution until we have very small images (bottleneck) and then we upsample again.*
     
    Learning process -compressing-: Smart way of encoding, in much less, all the information that is in the input image, and also be able to extract it again (decoder). 
    
    Usual images we're dealing with are large-scale structures, not noisy at all. Random images (noisy inputs) don't work. If we feed with a lot of cats pictures, the autoencoder will learn very well the typical (and so the important) features of the cats. 
            - Access to large datasets
            - Generate those inputs algorithmically

*E.g.:* (20 channels in all intermediate steps)

    input -> conv -> subsamp./4 -> conv -> subsamp./4 -> conv -> upsamp.x4 -> conv -> upsamp.x4 -> conv -> output

    (32x32)                                          -bottleneck-                                         (32x32)
    
**(1.1) Denoising autoencoder**: input images are noisy images but you compare, by means of the cost function, with a clean image -> you force the autoencoder to find a way to denoise the input image. Importantly, this only works when there's some structure on the images -not random noise but kindof structured/simulated-. 

**(1.2) Stacking autoencoders**: 

(1) Whenever two layers are connected they're connected by weights -> in convolutional case these would be the entries of the filter. Train 2 layers and fix the weights got in the first part of the training. 

(2) Second part, add two layers inside so that you keep the fixed layers up and down and in the middle you train the new ones. Fix their weights. Introduce new layers, etc. 

    Greedy layer-wise training:

                    FIXED
      _____         _____
      TRAIN         TRAIN
      ----- ----->  -----  ----> etc.
      TRAIN         TRAIN
      _____         _____
                    FIXED
                    
*Once we get something that works reasonably well, we can FINE-TUNE weights by training all of them together in the large multi-layer NN**

**(1.3) Other uses**:

1. Pre-training: train an autoencoder so that output=input
2. Use the encoder part of the autoencoder to build a classifier (trained via SUPERVISED LEARNING):
            input -> autoencoder part (lower half; before bottleneck) -> dense(softmax) -> output = category

**(1.4) Sparse autoencoder**: maybe we don't know how large we should choose the bottleneck -> if we pick a certain size, the net may make use of all of them (the neurons) nominally but maybe it's wasting resources -> add something to the **cost function** that depends on the **values of the neurons in this bottleneck layer** (cost function will be higer if some neurons have non-zero values and it'll be lower if some neurons have zero values) so that there's a reward when the net finds a way of encoding the information in fewer neurons! 

    - Force most neurons in the inner layer to be 0 or close, most of the time, by modifying the cost function.





### (+) PURELY LINEAR AUTOENCODER: Principal Components Analysis

- Which weights will it select? 

- Number of neurons in hidden layer is SMALLER than the number of input/output neurons. Each hidden layer can be understood as the projection of the input onto some vector (determined by the weights belonging to that neuron). 

- What's the best compression that such a LINEAR autoencoder can choose? 

                        ______
                        LINEAR (dense)
                        ------              no non-linearity f(z)!!
                        LINEAR (dense)
                        ______

In order to calculate the value of one neuron in the inner layer (---) we actually take a **linearly weighted superposition of all the input neurons**. For a dense layer: for another neuron I can have a different linear combination of the input neurons. 
    
    - Input neurons = one long vector of values.
    - To get a weighted superposition of them -> I can obtain the weighted superposition of the entries of a vector by 
    taking the scalar product with another vector: vector of weights. 
    - We're looking for scalar product operations! 
    
**Which are the optimal vectors to pick to take the scalar products of the input neurons with**: i.e., you give many different input vectors (images, etc.) -> choice of PROJECTING them on a few suitably chosen basis vectors (which ones should I pick in order to extract the maximum information?). 

(1) Set $w_{jk}=<v_j|k>$ for the **input-hidden weights**; $v_j$ are the **components of a vector** $v$ and $k$ a particular input neuron -> take the projection of input neuron $k$ onto $v_j$. 

*Neuron values of the hidden layer will be the amplitudes of this input vector $k$, all of them in the basis of these $v_j$ vectors.* 

                ojoo <- hidden layer   | o = neuron, j, k= index of a certain neuron
                ooko <- input 
                
(2) Set $w_{jk}=<k|v_j>$ for the **hidden-output weights**. 

                ojoo <- output  | o = neuron, j, k= index of a certain neuron
                ooko <- hidden
                
(3) *Assuming I've normalized $v_j$*, Set restricted projector: $\hat{P}=\sum_{j=1}^{M}|v_j><v_j|$, where M is the number of neurons in the hidden layer, which is smaller than the size of the Hilbert space and the vectors form an **orthonormal** basis (that we still have to choose in a smart way). 

(4) NN calculates $\hat{P}|\psi>$, the scalar product $<v_j||\psi>$ results into the weights with which the vector $v_j$ occurs in the output. 

\begin{equation}
 \hat{P}|\psi> = \sum_{j=1}^{M}|v_j><v_j||\psi> =  \sum_{j=1}^{M}w_{j}|v_j>
\end{equation}

- Tries to reproduce a vector (input) as well as possible with a **restricted basis set**. 

##### Assuming a normalized input vector: 

We want $|\psi>\approx \hat{P}|\psi>$ for all the typical input vectors. Assuming that the average has already been subtracted ($<|\psi>>=0$). 

  A. My NN produces: $\hat{P}|\psi>$
  
  B. Desired output: $|\psi>$

  C. Cost function: quadratic deviation: $\| desired - NN(output) \|^2$
        
Choose the vectors $v$ to *minimize the average quadratic deviation* (**my cost function**) $<\|\psi> - \hat{P}|\psi>\|^2>$, average over all the input vectors $|\psi>$:

\begin{equation}
 <\|\psi> - \hat{P}|\psi>\|^2>= <<\psi|\psi> - <\psi|\hat{P}|\psi>>
\end{equation}

The task is to minimize $<<\psi|\psi> - <\psi|\hat{P}|\psi>>$ by choosing $\hat{P}$ (the vectors $v_j$) in a suitable manner. 

**->** Consider the matrix: 

\begin{equation}
 \hat{\rho}=<|\psi><\psi|>=\sum_jp_j|\psi^{(j)}><\psi^{(j)}| \qquad \qquad \rho_{mn}=<\psi_m\psi_{n}*>
\end{equation}

$\rho_{mn}=<\psi_m\psi_{n}*>$ is a statistical correlator , density matrix; being $\rho$ the probability of having a particular input vector. This **COVARIANCE MATRIX** (*density matrix*) characterizes fully the **statistical properties**, in terms of quadratic combinations of the pixel values (in our case), of the ensemble of input vectors. Diagonalize this matrix and keep the $M$ eigenvectors with the **LARGEST EIGENVALUES** -> these form the desired set of $v$. 

