# Python for deep learning

## Cell 8

```python
%%time
history = model.fit(X, y, validation_split=0.25, epochs=75, batch_size=128, verbose=0)
print("Loss: ", history.history['val_loss'][-1], "Accuracy: ", history.history['val_accuracy'][-1])
```

``%%time`` is a cell-magic command. It measures the time the execution of the cell takes and displays it.

We start the training by calling the ```fit```-method of the keras model. The first two parameters are the training data. X contains in our case the intensity values in the neighborhoods of pixels, as linear vectors and y contains the corresponding labels 0 for a background pixel and 1 for a foreground pixel.

To examine the keras training, we set up a small example containing 10 vectors with 9 values each. 

In [1]:
import numpy as np
X = np.random.rand(1000, 9)
y = np.random.choice([0, 1], size=(1000,))
print(X[0:9],y[0:9])

[[0.67123105 0.34946902 0.72777989 0.88714441 0.65330903 0.74925807
  0.85460681 0.75065276 0.16080715]
 [0.67983417 0.52101065 0.77887776 0.90919822 0.53536738 0.97849012
  0.47484666 0.99409078 0.07815733]
 [0.31705855 0.22922093 0.33249467 0.62495531 0.94995288 0.49756771
  0.43625841 0.71180915 0.77723611]
 [0.49000266 0.86867663 0.5007947  0.69015263 0.76359736 0.90489905
  0.66953547 0.19135395 0.70661262]
 [0.69081995 0.57872875 0.45899877 0.52840393 0.71000968 0.93035691
  0.71818937 0.55712425 0.40878807]
 [0.42720517 0.99095053 0.66530854 0.90415868 0.00386464 0.46449464
  0.61187421 0.17478905 0.9911985 ]
 [0.65889869 0.19131752 0.03215729 0.69331708 0.85722604 0.32924242
  0.76249565 0.5212685  0.60851102]
 [0.31571892 0.72948761 0.11653683 0.34367257 0.33893792 0.78712985
  0.4869959  0.14850081 0.10583607]
 [0.89475585 0.46455625 0.07460115 0.28714372 0.85455922 0.73060112
  0.79878859 0.47844357 0.8991071 ]] [0 0 0 0 0 0 0 1 1]


We create the same model as in cell 8 again.

In [2]:
from keras.models import Sequential
from keras.layers import Dense
from IPython.display import SVG
N=3
model = Sequential()
model.add(Dense(N*N-1, input_dim=(N*N), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

Using TensorFlow backend.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 8)                 80        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 9         
Total params: 89
Trainable params: 89
Non-trainable params: 0
_________________________________________________________________


Now we can try to train the model. Will it be possible to learn a model from it?

In [3]:
%%time
history = model.fit(X, y, validation_split=0.25, epochs=10, batch_size=16, verbose=2)
print("Loss: ", history.history['val_loss'][-1], "Accuracy: ", history.history['val_accuracy'][-1])

Train on 750 samples, validate on 250 samples
Epoch 1/10
 - 0s - loss: 0.6913 - accuracy: 0.5373 - val_loss: 0.6956 - val_accuracy: 0.5160
Epoch 2/10
 - 0s - loss: 0.6909 - accuracy: 0.5320 - val_loss: 0.6951 - val_accuracy: 0.5160
Epoch 3/10
 - 0s - loss: 0.6907 - accuracy: 0.5360 - val_loss: 0.6948 - val_accuracy: 0.5160
Epoch 4/10
 - 0s - loss: 0.6905 - accuracy: 0.5320 - val_loss: 0.6949 - val_accuracy: 0.5160
Epoch 5/10
 - 0s - loss: 0.6900 - accuracy: 0.5347 - val_loss: 0.6947 - val_accuracy: 0.5200
Epoch 6/10
 - 0s - loss: 0.6900 - accuracy: 0.5387 - val_loss: 0.6947 - val_accuracy: 0.5200
Epoch 7/10
 - 0s - loss: 0.6896 - accuracy: 0.5360 - val_loss: 0.6946 - val_accuracy: 0.5160
Epoch 8/10
 - 0s - loss: 0.6896 - accuracy: 0.5347 - val_loss: 0.6946 - val_accuracy: 0.5160
Epoch 9/10
 - 0s - loss: 0.6895 - accuracy: 0.5373 - val_loss: 0.6947 - val_accuracy: 0.5160
Epoch 10/10
 - 0s - loss: 0.6894 - accuracy: 0.5333 - val_loss: 0.6946 - val_accuracy: 0.5160
Loss:  0.69458399915695

We get an accuracy around 0.5, which means thet the model does not learn

## verbose

If verbose is zero, no output is displayed during the training. For values bigger than one we get:

* 1 a progress bar that advances during the epoch and the accuracy and loss of the epoch for the training data and the validation data
* 2 same as 1 but without the progress bar
* 3 only the number of the current epoch is displayed

If verbose is bigger than zero, the number of samples in the training set and in the validation set is also displayed, according to the validation_split we have chosen.

## history

The fit method returns a history object, that contains the history of the accuracies and losses on the training and validation data. The history object has an attribute history which is a dictionary. The keys of the dictinoary specify the variable (val_loss, val_accuracy, loss, accuracy) and the object at each key is the list of the corresponding values.

In [4]:
print(history.history)

{'val_loss': [0.6956153378486634, 0.6950620679855347, 0.6948499026298522, 0.6948830862045288, 0.6946947407722474, 0.6946667695045471, 0.6945951614379883, 0.6945964879989625, 0.6946573438644409, 0.6945839991569519], 'val_accuracy': [0.515999972820282, 0.515999972820282, 0.515999972820282, 0.515999972820282, 0.5199999809265137, 0.5199999809265137, 0.515999972820282, 0.515999972820282, 0.515999972820282, 0.515999972820282], 'loss': [0.6913333868980408, 0.6908786125183105, 0.6906680084864298, 0.6905108768145244, 0.690043892065684, 0.6899571962356568, 0.6896142296791077, 0.6896303159395853, 0.6895077888170879, 0.6893591602643331], 'accuracy': [0.5373333, 0.532, 0.536, 0.532, 0.53466666, 0.53866667, 0.536, 0.53466666, 0.5373333, 0.53333336]}


In [5]:
%matplotlib inline
import matplotlib.pyplot as plt, mpld3
xValues = np.arange(1., len(history.history['loss'])+1, 1)
plt.subplot(1, 2, 1)
for (key, vector) in history.history.items():
    if key in ['val_accuracy', 'accuracy']:
        plt.plot(xValues, vector, label=key)
plt.legend(loc='center left')
plt.subplot(1, 2, 2)  
for (key, vector) in history.history.items():
    if key in ['val_loss', 'loss']:
        plt.plot(xValues, vector, label=key)
plt.legend(loc='right')
plt.tight_layout(pad=3.0)
mpld3.display()

## Again with 'learnable' data

We will give some structure to the data, so that a network can learn from it. We will just set the label to 0 if the mean of the input is small and to one if it is big

In [24]:
labels = np.array([int(round(sum(xv)/len(xv))) for xv in X])
print(X[0:9])
print(labels[0:9])

[[0.67123105 0.34946902 0.72777989 0.88714441 0.65330903 0.74925807
  0.85460681 0.75065276 0.16080715]
 [0.67983417 0.52101065 0.77887776 0.90919822 0.53536738 0.97849012
  0.47484666 0.99409078 0.07815733]
 [0.31705855 0.22922093 0.33249467 0.62495531 0.94995288 0.49756771
  0.43625841 0.71180915 0.77723611]
 [0.49000266 0.86867663 0.5007947  0.69015263 0.76359736 0.90489905
  0.66953547 0.19135395 0.70661262]
 [0.69081995 0.57872875 0.45899877 0.52840393 0.71000968 0.93035691
  0.71818937 0.55712425 0.40878807]
 [0.42720517 0.99095053 0.66530854 0.90415868 0.00386464 0.46449464
  0.61187421 0.17478905 0.9911985 ]
 [0.65889869 0.19131752 0.03215729 0.69331708 0.85722604 0.32924242
  0.76249565 0.5212685  0.60851102]
 [0.31571892 0.72948761 0.11653683 0.34367257 0.33893792 0.78712985
  0.4869959  0.14850081 0.10583607]
 [0.89475585 0.46455625 0.07460115 0.28714372 0.85455922 0.73060112
  0.79878859 0.47844357 0.8991071 ]]
[1 1 1 1 1 1 1 0 1]


In [28]:
N=3
model = Sequential()
model.add(Dense(N*N-1, input_dim=(N*N), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

In [29]:
%%time
history = model.fit(X, labels, validation_split=0.25, epochs=1000, batch_size=256, verbose=0)
print("Loss: ", history.history['val_loss'][-1], "Accuracy: ", history.history['val_accuracy'][-1])

Loss:  0.09990853816270828 Accuracy:  0.9760000109672546
CPU times: user 5.03 s, sys: 366 ms, total: 5.39 s
Wall time: 3.3 s


In [30]:
%matplotlib inline
import matplotlib.pyplot as plt, mpld3
plt.rcParams["figure.figsize"] = (15,5)
xValues = np.arange(1., len(history.history['loss'])+1, 1)
plt.subplot(1, 2, 1)
for (key, vector) in history.history.items():
    if key in ['val_accuracy', 'accuracy']:
        plt.plot(xValues, vector, label=key)
plt.legend(loc='center right')
plt.subplot(1, 2, 2)  
for (key, vector) in history.history.items():
    if key in ['val_loss', 'loss']:
        plt.plot(xValues, vector, label=key)
plt.legend(loc='right')
plt.tight_layout(pad=3.0)
mpld3.display()

## validation_split

validation_split tells how the data is split into training and validation data. validation_split=0.25 means 75% of the data is used for training and 25% for validation.

## epochs

In one epoch one training (optimization) step is a accomplished using the whole training data once. The gradient descent modifies the weights of the network to go one step into the direction of a minimum of the loss function. Then in the next epoch, starting from the result of the last epoch, the whole data is used again to calculate the next step in direction minimum and so on.


How many epochs you need depends on the data. If at some point the accuracy doesn't augment and the loss doesn't decrease you should stop. If the training accuracy augments but the validation accuracy decreases you are observing overfitting, which means that the learned function is too specific to the training data and does not generalize to new data points. You should also stop before this happens.

## batch_size

In each epoch the whole training data is used to do one optimization step. However the data can be big, too big to fit into the memory of the computer. The training data can divided into batches which are fed sequentially into the training. In our example we have a dataset of size 1000. However we use one fourth for validation, which leves us with 750 input vectors. If we use a batch size of 64, the optimizer will use the first 64 vector in the first iteration, the next 64 in the second iteration and so on. For the last iteration only 46 samples are left so the batch of the lat iteration will be smaller. Our training will be done in ceil(750 / 64) = 12 iterations for each epoch.

The smaller the batch size, the less memory is needed but the slower the training will be. Note that it makes a difference here if you are runnning on GPUs or CPUs, on GPU's a lot of batches can be run in parallel and there is probably less memory available on the GPU.

The batch_size also has an influence on the accuracy we achieve and at what speed we achive it. Why is this? If we pass in all data at once the best gradient is calculated from all the data. If we use mini-batches, the best gradient is calculated for each batch and averaged at the end. Although using the whole data gives a more exact result for the best gradient (i.e. the direction in which to go), the less exact results from averaging can lead to going to the optimum faster.

Note that in our examples here we have the whole data loaded into memory from the beginning, so using mini-batches will not reduce the moemory footprint. However this can be done differently by using a generator for the input data, that loads the data on demand.

## Testing the model

Since we use the the validation data to optimize the meta-parameters, i.e. to tune the parameters of our model, like the optimization method the learning rate, etc, we must make sure that we did not optimize them just for the validation data in this set. We therefore need independent data, that was never used in the training to evaluate the performance of the model.  

In [31]:
import numpy as np
text_X = np.random.rand(1000, 9)
test_y = np.random.choice([0, 1], size=(1000,))
test_y = np.array([int(round(sum(xv)/len(xv))) for xv in text_X])
print(text_X[0:10], test_y[0:10])

[[0.43186672 0.04879819 0.01810208 0.627151   0.66007342 0.61528463
  0.49856181 0.72915809 0.30619641]
 [0.54109699 0.94599249 0.72708105 0.53668348 0.9104499  0.97552939
  0.9725896  0.13202726 0.95167962]
 [0.13880843 0.67516891 0.62981764 0.6172652  0.54399528 0.95615911
  0.99486013 0.81217005 0.03797116]
 [0.04750865 0.54775481 0.2326713  0.64968963 0.12517955 0.14802572
  0.32922709 0.61339217 0.76868123]
 [0.43888809 0.25962614 0.16183187 0.10170387 0.12275018 0.99263401
  0.43959251 0.7302682  0.38239831]
 [0.56827225 0.37145455 0.72440862 0.13833485 0.07321749 0.38357806
  0.10076268 0.75276447 0.75871929]
 [0.97340127 0.65321242 0.52900351 0.45563476 0.19219763 0.2037329
  0.89605518 0.30052887 0.99130075]
 [0.28327112 0.2928653  0.79035455 0.48467399 0.17184652 0.37459229
  0.66728695 0.89005619 0.77940259]
 [0.06626257 0.44902758 0.46550016 0.26764627 0.44070851 0.80037043
  0.35957728 0.231035   0.17514572]
 [0.54332993 0.75982139 0.36279616 0.41941454 0.65286571 0.341380

In [38]:
result = model.evaluate(text_X, test_y)
dict(zip(model.metrics_names, result))



{'loss': 0.09963842654228211, 'accuracy': 0.984000027179718}

We see that the accuracy and loss on the test dataset are very similar to the validation accuracy and loss. 