<div style="text-align:right"> Practical session 1 <br/> Sébastien Harispe
<h1><center>Keras Introduction: <br/> Practical introduction to Artificial Neural Networks <br/> IMT Mines Alès</center></h1>

<center>

![picture](https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Keras_logo.svg/200px-Keras_logo.svg.png)

</center>


## Overview

In this short tutorial we will see basics of [Keras](https://keras.io/) including: 
* Generalities about the philosophy and basic constructs of the library.
* How to create, train, evaluate, save and use a model.

The main sources of information used to prepare this tutorial session are: 
* https://keras.io
* https://www.tensorflow.org
* https://www.tensorflow.org/tutorials/keras

We will further use ANN to refer to an Artificial Neural Network.


### Keras

**Keras** is a high-level API that can be used to easily build and train ANN such as those used in deep learning models.  
It is used for fast prototyping, quick research testing, and production, with three key advantages:
* *User friendly* It has a simple, consistent interface optimized for common use cases, and provides clear and actionable feedback for user errors.
* *Modular and composable* Keras models are made by connecting configurable building blocks together, with few restrictions.
* *Easy to extend* Write custom building blocks to express new ideas for research. Create new layers, loss functions, and develop state-of-the-art models.

Keras has originally been developed by François Chollet (Frenchy \o/) now a Google engineer. 

Keras is capable of running on top of several plateforms dedicated to neural networks and machine learning. In this practical session we will use the TensorFlow backend. Late developements have made Keras tightly linked to TensorFlow.

**TensorFlow** is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. It is used for both research and production at Google. ‍(‍[source](https://en.wikipedia.org/wiki/TensorFlow))‍.

<center>

![picture](https://upload.wikimedia.org/wikipedia/commons/thumb/1/11/TensorFlowLogo.svg/200px-TensorFlowLogo.svg.png)

</center>

TensorFlow's implementation of the Keras API specification is exposed in the `tf.keras` module. This is a high-level API to build and train models that makes TensorFlow easier to use without sacrificing flexibility and performance.
The main competitor of TensorFlow is [PyTorch](https://pytorch.org/) (which do not support Keras API).



## Prerequisites

* Basics of machine learning 
* Basics of ANN (Artificial Neural Networks): you must be familiar with notions such as MLP, Relu, Cross entropy, sofmax, dropout. 

We provide quick refresher below. 

---

### Artificial Neuron

Artificial neurons are the basic computational units of ANNs.
It can generally be viewed as a simple non linear function that will process input values to produce a single output - Note that the term input value do not necessarily refers to the values of the input the predictor must process. Those input values are just numerical values and may be produced by other artificial neurons. Indedd, Artificial neurons are then structured into layers that will form ANN. 

In its simple form, an artificial neuron is nothing but an affine function composed with a non linear activation function.  

For a given artificial neuron $k$, let there be $m$ input values $x_1$ through $x_m$ and weights $w_{k1}$ through $w_{km}$. We also usually consider an additional input $x_0=1$ to model a bias input with $w_{k0} = b_k$. 

The output of the $k$-th neuron is: 

$y_k = \phi(\sum_{j=0}^{m} w_{kj} x_j)$

$\phi$ is a defined activation function such as ReLU (see below). 

---

### Rectified Linear Unit (ReLU)

ReLU is a popular [activation function](https://en.wikipedia.org/wiki/Activation_function) that can be used while defining an artificial neuron. 

$\phi(x) = max(0,x)$

![picture](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6c/Rectifier_and_softplus_functions.svg/500px-Rectifier_and_softplus_functions.svg.png)

Using ReLU activation function, the neuron defined above would therefore output: 

$y_k = max(0, \sum_{j=0}^{m} w_{kj} x_j)$

---

### Multi Layer Perceptron (MLP) 

An MLP is a simple ANN corresponding to one or several layers of neurons.

Each neuron of the first layer processes the values of the input to analyse. Neurons of layer $ 1 < l $ will further consider as inputs the outputs of the neurons of layer $l-1$. 

By stacking layers in such a way we can transform an input vector $x \in \mathbb{R}^m$ into an output vector into $\mathbb{R}^d$. Such an output vector may directly answers the requirement for a regression task, or be used as input of a softmax function to obtain a probability distribution suitable for a classification task, cf. softmax function below.    

In MLP, computations that will produce the prediction are made from the input values to the output values without reccurrent connexion between neurons. We say that MLP are feedforward ANN, in opposition for instance to Recurrent ANN (RNN). 


*Note*: do not be disturbed by the term Perceptron. It refers to a famous old ML algorithm. In the context of ANN, the term perceptron refers to a specific type of artificial neuron based on a specific activation function (heaviside step function). Such artificial neurons are not used anymore in recent ML models, and people use the term MLP to refer to ANN that are composed of layers of any types of artificial neurons. 


Play with a neural network architecture: https://playground.tensorflow.org/: 

---

### Softmax function 

The softmax function can be used to transform a vector of real-values $ \mathbf{z} \in \mathbb{R}^{|\mathcal{C}|}$ into a vector $\mathbf{z'} \in [0,1]^{|\mathcal{C}|}$ such as $\sum_i^{|\mathcal{C}|} \mathbf{z'}_i = 1$, with $\mathbf{z'}_i = \text{softmax}(\mathbf{z})_i$:

$\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{|\mathcal{C}|} e^{z_j}} \text{ for } i = 1, \dotsc , |\mathcal{C}| \text{ and } \mathbf z=(z_1,\dotsc,z_{|\mathcal{C}|}) \in \mathbb{R}^{|\mathcal{C}|}$

The softmax function is often used as the last treatment of a neural network to normalize the output of a network and to obtain a probability distribution over output classes to consider. 

Considering a muticlass prediction setting with a set of classes $\mathcal{C}$, the modelling is generally the following:  
* A network predict a vector $\mathbf{z} \in \mathbb{R}^{|\mathcal{C}|}$
* $\mathbf{z}$ is converted using the softmax into a probability distribution $\mathbf{z'} \in [0,1]^{|\mathcal{C}|}$ with $\sum_i^{|\mathcal{C}|} \mathbf{z'}_i = 1$. 
* The final prediction is then based on a decision rule exploiting $\mathbf{z'}$, e.g in a single label setting the prediction will be defined as $\hat{y} := \text{argmax}_{i \in 1,\ldots,|\mathcal{C}|} \mathbf{z'}_i$



In practice, the softmax can be seen as a soft [argmax](https://en.wikipedia.org/wiki/Arg_max) (differentiable argmax).

---

### Cross entropy

The cross entropy is often used as loss training a ANN for classification.

First recall the log loss considered to evaluate a probabilistic binary classifier. 

For a given input observation we expect a prediction $y \in \{0,1\}$ (label is true or false). 
The classifier will compute the probability $p \in [0,1]$ that the input is labelled by the true label: $p = \hat{\mathbb{P}}[y=1]$.

The loss for this entry can be defined by: 

$l(p,y) = −(y log(𝑝)+(1−𝑦)log(1−𝑝))$

In a *similar fashion*, the cross entropy can be used to consider multiclass settings. It can be used to compare two probability distributions $p$ and $q$ over the same underlying set of events [(source)](https://en.wikipedia.org/wiki/Cross_entropy). Considering two discrete probability distributions $p$ and $q$ over a support $\mathcal{X}$ the cross entropy $H(p,q) \in [0,+\infty]$ is defined by:

$H(p,q)=-\sum _{x\in {\mathcal {X}}}p(x)\,\log q(x)$. 

With $\mathbf{y} \in [0,1]^{|\mathcal{C}|}$ the expected distribution and $\hat{\mathbf{y}} \in [0,1]^{|\mathcal{C}|}$ the predicted one.


$H(\mathbf{y},\hat{\mathbf{y}}) = -\sum _{i\in {1,\ldots,|\mathcal{C}|}} \mathbf{y}_i\,\log \hat{\mathbf{y}}_i$.

---

### Dropout

Dropout is a regularization technique for reducing overfitting in ANN by preventing complex co-adaptations on training data (specific case of different regularization strategies, [source](https://en.wikipedia.org/wiki/Dilution_(neural_networks))). It aims at randomly "dropping out", or omitting, units (both hidden and visible) during the training process of a neural network to avoid overfitting on training data.

---

# Keras in Action


We will now present some examples and discuss interesting aspects of the API.

You'll then be asked to refer to more advanced examples presented in the official documentation : https://www.tensorflow.org/tutorials

The version of Tensorflow used in this tutorial is:

In [None]:
import tensorflow as tf
tf.__version__

'2.6.0'

### Handwritten digit classification

Let's consider a first example in which we want to classify images using MNIST dataset.

MNIST is a dataset of thousands 28x28 grayscale images representing handwritten digits of the 10 digits. Several examples of images are shown below. 

https://keras.io/api/datasets/mnist/

![picture](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

Providing a grayscale 28x28 image representing an handwritten digit we are interested by predicting the written number. 

Keep in mind that we are looking for a predictor:

$\hat{f} : [0,255]^{28 \times 28} \rightarrow \mathcal{C} = \{0,\ldots, 9\} \subset \mathbb{N}_0$

Instead of directly predicting the output class of an input image, we will predict the probability that an input refers to a class. The output of the model will therefore be a probability distribution over $\mathcal{C}$. 

Using such a modelling, the cross entropy loss will be used to train our model.

We therefore consider :

$\hat{f} : [0,255]^{28 \times 28} \rightarrow [0,1]^{|\mathcal{C}|}$ with for $\forall x \in [0,255]^{28 \times 28}$, $\hat{f}(x)$ a valid probability distribution.

**Code analysis**

Keras is an intuitive API.

Are you able to understand the source code below?

Run it to ensure your installation is working. 



In [None]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout

(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(f"size training set {len(x_train)}")
print(f"size test set {len(x_test)}")
print(f"dimension {len(x_train[0])} x {len(x_train[0][0])}")

x_train, x_test = x_train / 255.0, x_test / 255.0 # normalizing dataset


mnist_model = Sequential([
  Flatten(input_shape=(28, 28)),
  Dense(128, activation='relu'),
  Dense(10, activation='softmax')
])

mnist_model.summary()

mnist_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

mnist_model.fit(x_train, y_train, epochs=5)

print("\nEvaluation")
mnist_model.evaluate(x_test, y_test)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
size training set 60000
size test set 10000
dimension 28 x 28
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               100480    
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

Evaluation


[0.07762255519628525, 0.9763000011444092]

<font color='red'>Explained version below : try to understand it by yourself first!</font>

In [None]:
# Explained version
# tensorflow.keras.datasets contains several datasets see https://keras.io/api/datasets/
from tensorflow.keras.datasets import mnist 
# Sequential enables plain stack of layers where each layer has exactly one input tensor and one output tensor
from tensorflow.keras.models import Sequential 
# the layers we will use 
from tensorflow.keras.layers import Flatten, Dense, Dropout 

# Load the dataset retrieving the train and test splits 
# The dataset will be automatically downloaded from a remote server
# In this specific case we are using MNIST
# Grayscale 28 x 28 images in a multiclass setting
# the dataset already contains train and test splits
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(f"size training set {len(x_train)}")
print(f"size test set {len(x_test)}")
print(f"dimension {len(x_train[0])} x {len(x_train[0][0])}")
print(y_train[0])

x_train, x_test = x_train / 255.0, x_test / 255.0 # normalize grayscale image (max = 255)

# We now create our model considering a feedforward network (MLP) consisting of a stacking of layers. 
# Note that the shape of the input is 28 x 28
# The layers will be processed sequentially, using Sequential we pass the layers as parameters 
# 4 layers are defined :
# [1] Flatten, We flatten the 28 x 28 input matrix such as we obtain a vector representation into R^(784=28 x 28).
#     The shape of the input is defined for the first layer (dimension of outputs, and internal layer inputs will be infered). 
# [2] Dense is a fully connected layer parametrized to be composed of 128 neurons each of them considering the ReLU activation function
# [3] We apply a dropout between the two last layers (cf. dense layers, layers 2 and 4 ), this is also expressed as a layer.
# [4] Last dense layer aims at producing the probability distribution. 
#     Technically we apply a linear reduction from R^128 to R^10 to which we apply softmax
#     to obtain our valid probability distribution
mnist_model = Sequential([
  Flatten(input_shape=(28, 28)),
  Dense(128, activation='relu'),
  Dropout(0.2),
  Dense(10, activation='softmax')
])

# The way to process an input is then defined. 
# Lets look at what looks our network

mnist_model.summary()

# Model: "sequential_1"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #   
# =================================================================
# flatten_1 (Flatten)          (None, 784)               0         
# _________________________________________________________________
# dense_2 (Dense)              (None, 128)               100480 = 784 x 128 + 128 (biais) 
# _________________________________________________________________
# dropout_1 (Dropout)          (None, 128)               0         
# _________________________________________________________________
# dense_3 (Dense)              (None, 10)                1290  = 128 x 10 + 10 (biais)     
# =================================================================
# Total params: 101,770
# Trainable params: 101,770
# Non-trainable params: 0


# We now have to define how to train our model
# We will train our model with regard to the cross entropy
# using the adam optimizer (that will use the gradient to optimize the parameters)
# We also specify some metrics to compute (additionally to the loss)
mnist_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Training will be performed using backpropagation 
# taking advantage of the gradient computed
# from the partial derivative of the error (cross entropy) with regard to the 101,770 parameters
# The training set will be used 5 times (5 epochs will be performed)
mnist_model.fit(x_train, y_train, epochs=5)

print("\nEvaluation")
mnist_model.evaluate(x_test, y_test)

size training set 60000
size test set 10000
dimension 28 x 28
5
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               100480    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
 367/1875 [====>.........................] - ETA: 3s - loss: 0.1672 - accuracy: 0.9505

KeyboardInterrupt: ignored

**Disclaimer** 
<font color='red'>
This example aimed at introducing you to a simple model. This is very far from state-of-the-art models that can are today used to perform image classification. Do not consider to use simple MLP to perform standard image processing (cf. to CNN for instance). Keras advanced tutorial show you how to use state-of-the-art models.
</font>

## Keras Main notions

### Datasets

Keras contains numerous datasets that can be used to test existing or new models on a variety of data types. 

More information can be found in https://keras.io/datasets/

Examples of datasets:
* CIFAR10 small image classification. Dataset of 50,000 32x32 color training images, labeled over 10 categories, and 10,000 test images.
* CIFAR100 small image classification. Dataset of 50,000 32x32 color training images, labeled over 100 categories, and 10,000 test images.
* IMDB Movie reviews sentiment classification. Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative).
* Reuters newswire topics classification. Dataset of 11,228 newswires from Reuters, labeled over 46 topics.
* MNIST database of handwritten digits (0 to 9). Dataset of 60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images.

Datasets can simply be loaded from `tensorflow.keras.datasets`.

Those datasets can be very useful for testing existing architectures. 

**Note**: you can also easily load your dataset from your favorite Numpy structures https://www.tensorflow.org/tutorials/load_data/numpy

Considering that your training examples and associated lables are store into Numpy array you can easily do : 

`train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))`

Note however that this is not even required for most use cases dealing with small datasets as you can `fit` a model directly from basic data structures (Numpy arrays, lists) : https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit 

### Models

Keras offers various types of models that can be used to define a model defining groups of layers into an object with training and inference features. The various models are defined in `tensorflow.keras.models`. 

A `Sequential` model can be used to represent a Linear stack of layers. https://keras.io/getting-started/sequential-model-guide/

### Layers 

Layers are used to define a process to apply to an input to produce an output. 
They will be the building block of the neural nets we will define. 
In Keras, layers will also be used to apply processing treatments that are not directly defining the architecture of the network, e.g. Dropout.

Examples of layers: 
* Dense: densely connected layers, i.e. each input dimension is linked to the neurons. 
* Flatten: can be used to flatten a tensor into a vector (1D-Tensor)
* Dropout: sets some unit values to 0 (regularization)

Layers are explained in https://keras.io/api/layers/.

### Model training and evaluation 

Finally, the model can be trained and evaluated in a scikit-learn fashion.
The traditional steps are: 
1. Compile
2. Fit
3. Evaluate
4. Predict

#### Model Compilation

A model can next be compiled by specifying: 
* The optimizer used to train the network
* The loss function to minimize
* Metrics that have to be evaluated

In [None]:
mnist_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

`tf.keras.Model.compile` takes three important arguments:
* `optimizer`: This object specifies the training procedure. Pass it optimizer instances from the `tf.keras.optimizers` module, such as `tf.keras.optimizers.Adam` or `tf.keras.optimizers.SGD`. If you just want to use the default parameters, you can also specify optimizers via strings, such as 'adam' or 'sgd'.
* `loss`: The function to minimize during optimization. Common choices include mean square error (mse), categorical_crossentropy, and binary_crossentropy. Loss functions are specified by name or by passing a callable object from the `tf.keras.losses module`.
* `metrics`: Used to monitor training. These are string names or callables from the `tf.keras.metrics` module.

**Model Training**

In [None]:
mnist_model.fit(x_train, y_train, epochs=5)

From the documentation you can read 

`tf.keras.Model.fit` takes three important arguments:
* `epochs`: Training is structured into epochs. An epoch is one iteration over the entire input data (this is done in smaller batches).
* `batch_size`: When passed NumPy data, the model slices the data into smaller batches and iterates over these batches during training. This integer specifies the size of each batch. Be aware that the last batch may be smaller if the total number of samples is not divisible by the batch size.
* `validation_data`: When prototyping a model, you want to easily monitor its performance on some validation data. Passing this argument—a tuple of inputs and labels—allows the model to display the loss and metrics in inference mode for the passed data, at the end of each epoch.

Below an example using validation data.
Note also the interaction with numpy. 

In [None]:
import numpy as np

import tensorflow as tf
from tensorflow.keras import layers


# parameters
size_dataset = 200
input_size = 10

size_hidden_layer_1 = 64
size_hidden_layer_2 = 128

# Fixed parameters
nb_categories = 3

model = tf.keras.Sequential([  
  # Adds a densely-connected layer
  layers.Dense(size_hidden_layer_1, activation='relu', input_shape=(input_size,)),

  # Add another densely-connected layer
  layers.Dense(size_hidden_layer_2, activation='relu'),
    
  # Add a softmax layer with nb_categories output units:
  layers.Dense(nb_categories, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

# Data generating functions
def label(x): return 0 if x[0] > 20 else 1 if x[0] > x[-1] else 2
def get_labeled_data(size):
    d, l = [], []
    for x in np.random.randint(100,size=(size, input_size)):
        l.append(label(x))
        d.append(x)
    return np.array(d), tf.keras.utils.to_categorical(np.array(l))
    
data_train, labels_train = get_labeled_data(10000)
data_valid, labels_valid = get_labeled_data(1000)
data_test, labels_test = get_labeled_data(1000)

model.fit(data_train, labels_train, epochs=20, batch_size=32, validation_data=(data_valid, labels_valid))

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_21 (Dense)             (None, 64)                704       
_________________________________________________________________
dense_22 (Dense)             (None, 128)               8320      
_________________________________________________________________
dense_23 (Dense)             (None, 3)                 387       
Total params: 9,411
Trainable params: 9,411
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fed7f67fc10>

The tf.keras.Model.evaluate and tf.keras.Model.predict methods can use NumPy data and a tf.data.Dataset.

**Prediction**

In [None]:
result = model.predict(data_test, batch_size=32)

for i in range(0,10):
    print(f"data {i}: {data_test[i]}")
    print(f"probabilities {result[i]} expected: {labels_test[i]}")

data 0: [ 9 61 77 94 93 75 23 58 27 38]
probabilities [2.8970567e-07 5.7150908e-07 9.9999917e-01] expected: [0. 0. 1.]
data 1: [63 58 39 10 13 65 71 50 17 63]
probabilities [1.0000000e+00 6.2825308e-14 1.7991006e-17] expected: [1. 0. 0.]
data 2: [43 82 86 33 54 39 38 72  6 35]
probabilities [9.9999893e-01 1.0941762e-06 4.3584276e-08] expected: [1. 0. 0.]
data 3: [17 95 51 28 94 66 38 87  4 20]
probabilities [0.03390077 0.9315492  0.03454999] expected: [0. 0. 1.]
data 4: [42 60 66 60 98 95  7 82 49 75]
probabilities [9.9999988e-01 2.1625024e-09 8.8790671e-08] expected: [1. 0. 0.]
data 5: [47 33 38 33 69 83 62 77 95 94]
probabilities [1.0000000e+00 7.7543542e-14 2.2384858e-11] expected: [1. 0. 0.]
data 6: [ 1 81 82 68 19 24 98 39 12 64]
probabilities [1.5457119e-10 2.5052466e-10 1.0000000e+00] expected: [0. 0. 1.]
data 7: [89 23  4  2 51 10 86  7 73 96]
probabilities [1.0000000e+00 2.1315787e-22 1.6678792e-26] expected: [1. 0. 0.]
data 8: [56 37 29 25  8 57  0 21  2  7]
probabilities [1.

**Saving model**

The entire model can be saved to a file that contains the weight values, the model's configuration, and even the optimizer's configuration. This allows you to checkpoint a model and resume training later—from the exact same state—without access to the original code.

More information about saving and loading a model: https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb

Note that you can save versions of your model while training it (using callbacks that are introduced below)

In [None]:
# Save entire model to a HDF5 file
mnist_model.save('./tmp/mnist_model.h5')

# Recreate the exact same model, including weights and optimizer.
model = tf.keras.models.load_model('./tmp/mnist_model.h5')

## Additional info

**Callbacks**

A callback is an object passed to a model to customize and extend its behavior during training. You can write your own custom callback, or use the built-in `tf.keras.callbacks` that include:
* `tf.keras.callbacks.ModelCheckpoint`: Save checkpoints of your model at regular intervals.
* `tf.keras.callbacks.LearningRateScheduler`: Dynamically change the learning rate.
* `tf.keras.callbacks.EarlyStopping`: Interrupt training when validation performance has stopped improving.
* `tf.keras.callbacks.TensorBoard`: Monitor the model's behavior using [TensorBoard](https://www.tensorflow.org/tensorboard), a tool that can be used to ease the development of models .

To use a `tf.keras.callbacks.Callback`, pass it to the model's fit method:

In [None]:
from datetime import datetime
log_dir = "logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")

callbacks = [
  # Interrupt training if `val_loss` stops improving for over 2 epochs
  tf.keras.callbacks.EarlyStopping(patience=2, monitor='loss'),
  # Write TensorBoard logs to `./logs` directory
  tf.keras.callbacks.TensorBoard(log_dir=log_dir)
]

# Trains for 5 epochs
mnist_model.fit(x_train, y_train, batch_size=50, epochs=50, callbacks=callbacks)

print("\nEvaluation")
mnist_model.evaluate(x_test, y_test)

**Tensorboard**

In [None]:
%load_ext tensorboard
%tensorboard --logdir logs

**Functional syntax**

Note that you can use a functional syntax to express your networks

In [None]:
import numpy as np
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# parameters
input_size = 10
size_hidden_layer_1 = 64
size_hidden_layer_2 = 128

# Fixed parameters
nb_categories = 3


# This returns a tensor
inputs = Input(shape=(input_size,))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(size_hidden_layer_1, activation='relu')(inputs)
x = Dense(size_hidden_layer_2, activation='relu')(x)

predictions = Dense(nb_categories, activation='softmax')(x)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()
model.fit(data_train, labels_train, epochs=20, batch_size=32, validation_data=(data_valid, labels_valid))

result = model.predict(data_test, batch_size=32)
print("result[0]",str(result[0]))

You can easily make quick tests using different architectures

In [None]:
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# Data generating functions
def label(x): return 0 if x[0] > 20 else 1 if x[0] > x[-1] else 2
def get_labeled_data(size, input_size):
    d, l = [], []
    for x in np.random.randint(100,size=(size, input_size)):
        l.append(label(x))
        d.append(x)
    return np.array(d), tf.keras.utils.to_categorical(np.array(l))

def addLayers(src_layer, layers_desc):
    x = src_layer
    for layer_type, size, activation in layers_desc:
        layer_class = getattr(layers, layer_type.capitalize())
        x = layer_class(size, activation=activation.lower())(x)
    return x

# parameters
input_size = 10
size_hidden_layer_1 = 64
size_hidden_layer_2 = 128

# Fixed parameters
nb_categories = 3

# This returns a tensor
inputs = Input(shape=(input_size,))

layer_descs = [("Dense", 64, "relu"),("Dense", 128, "relu"),("Dense", nb_categories, "softmax")]
layer_descs_xxl = [("Dense", 64, "relu")] + ([("Dense", 128, "relu"),] * 10) + [("Dense", nb_categories, "softmax")]

print(layer_descs_xxl)

predictions = addLayers(inputs, layer_descs)
predictions_xxl = addLayers(inputs, layer_descs_xxl)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model_xxl = Model(inputs=inputs, outputs=predictions_xxl)
model_xxl.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model_xxl.summary()

data_train, labels_train = get_labeled_data(10000,input_size)
data_valid, labels_valid = get_labeled_data(1000,input_size)
data_test, labels_test = get_labeled_data(1000,input_size)

model.fit(data_train, labels_train, epochs=20, batch_size=32, validation_data=(data_valid, labels_valid))
model_xxl.fit(data_train, labels_train, epochs=20, batch_size=32, validation_data=(data_valid, labels_valid))

print("*" * 50)
model.evaluate(data_test, labels_test)
model_xxl.evaluate(data_test, labels_test)

[('Dense', 64, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 128, 'relu'), ('Dense', 3, 'softmax')]
Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 10)]              0         
_________________________________________________________________
dense_27 (Dense)             (None, 64)                704       
_________________________________________________________________
dense_28 (Dense)             (None, 128)               8320      
_________________________________________________________________
dense_29 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_30 (Dense)             (None

[0.11066389828920364, 0.9750000238418579]

# Exercice 1

* Download the <a href='https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29'>Breast cancer dataset</a> (or any dataset corresponding to a supervised machine learning problem in which inputs are vectors of numerical values).
* Define a neural network that can be tested on this problem. Implement it using Keras.
* Compare several networks architectures (modifying the number of layers and specific parameters such as activation functions...).
* Discuss the results.



# Keras for Text Processing

Read https://www.tensorflow.org/tutorials/keras/text_classification and perform the given exercice. 

Read https://www.tensorflow.org/tutorials/keras/text_classification_with_hub
You'll learn about Transfer learning and TF Hub