<a href="https://colab.research.google.com/github/Toni-Navarro/deep-learning/blob/master/Improving%20my%20first%20model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Improving my first deep learning model

We will use the preload data set from keras with several images of different handwritten digits in order to build a neural net model to identify them. We will modify the parameter to reach better accuracy


In [9]:
#Preparation of the model environment

%tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras import Model

print(tf.__version__)

2.1.0


In [0]:
#Loading the data
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [17]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)


In [0]:
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

**Why should the data be shuffled for machine learning tasks**

The process of training a neural network is to find the minimum value of a loss function LX(W), where W represents a matrix (or several matrices) of weights between neurons and X represents the training dataset. I use a subscript for X to indicate that our minimization of L occurs only over the weights W (that is, we are looking for W such that L is minimized) while X

is fixed.

Now, if we assume that we have P
elements in W (that is, there are P weights in the network), L is a surface in a P+1-dimensional space. To give a visual analogue, imagine that we have only two neuron weights (P=2). Then L has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights W, the loss function can be evaluated on X

and that value becomes the elevation of the surface.

But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if X
is unchanged over all training iterations, because the surface is fixed for a given X

; all its features are static, including its various minima.

A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, X
changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same X. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration i with training mini-batch Xi. This local minimum corresponds to L evaluated at a particular value of weights; we'll call it LXi(Wi). On the next iteration the shape of our loss surface actually changes because we are using Xi+1, that is, LXi+1(Wi) may take on a very different value from LXi(Wi) and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of LXi+1 will -- in general -- be different from that of LXi. Note that here I am referring to the loss function L evaluated on a training set X; it is a complete surface defined over all possible values of W, rather than the evaluation of that loss (which is just a scalar) for a specific value of W

. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).

One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because L
is evaluated by computing a value for each row of X (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices W, the arrangement of the rows of X has no effect when using full-batch gradient descent (that is, when each batch is the full X, and iterations and epochs are the same thing).

batch and shuffle the dataset

In [0]:
train_ds = tf.data.Dataset.from_tensor_slices(
    (x_train, y_train)).shuffle(10000).batch(32)

test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(32)

In [20]:
print(train_ds)
print(test_ds)

<BatchDataset shapes: ((None, 28, 28, 1), (None,)), types: (tf.float64, tf.uint8)>
<BatchDataset shapes: ((None, 28, 28, 1), (None,)), types: (tf.float64, tf.uint8)>


Building model

In [0]:
class MyModel(Model):
  def __init__(self):
    super(MyModel, self).__init__()
    self.conv1 = Conv2D(32, 3, activation='relu')
    self.flatten = Flatten()
    self.d1 = Dense(128, activation='relu')
    self.d2 = Dense(10)

  def call(self, x):
    x = self.conv1(x)
    x = self.flatten(x)
    x = self.d1(x)
    return self.d2(x)

# Create an instance of the model
model = MyModel()

Choose an optimizer and loss function for training:

In [0]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

optimizer = tf.keras.optimizers.Adam()

Select metrics to measure the loss and the accuracy of the model. These metrics accumulate the values over epochs and then print the overall result.

In [0]:
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

test_loss = tf.keras.metrics.Mean(name='test_loss')
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='test_accuracy')

Use tf.GradientTape to train the model:

In [0]:
@tf.function
def train_step(images, labels):
  with tf.GradientTape() as tape:
    # training=True is only needed if there are layers with different
    # behavior during training versus inference (e.g. Dropout).
    predictions = model(images, training=True)
    loss = loss_object(labels, predictions)
  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  train_loss(loss)
  train_accuracy(labels, predictions)

Test the model:

In [0]:
@tf.function
def test_step(images, labels):
  # training=False is only needed if there are layers with different
  # behavior during training versus inference (e.g. Dropout).
  predictions = model(images, training=False)
  t_loss = loss_object(labels, predictions)

  test_loss(t_loss)
  test_accuracy(labels, predictions)

In [26]:
EPOCHS = 5

for epoch in range(EPOCHS):
  # Reset the metrics at the start of the next epoch
  train_loss.reset_states()
  train_accuracy.reset_states()
  test_loss.reset_states()
  test_accuracy.reset_states()

  for images, labels in train_ds:
    train_step(images, labels)

  for test_images, test_labels in test_ds:
    test_step(test_images, test_labels)

  template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
  print(template.format(epoch+1,
                        train_loss.result(),
                        train_accuracy.result()*100,
                        test_loss.result(),
                        test_accuracy.result()*100))



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Epoch 1, Loss: 0.13616277277469635, Accuracy: 95.9183349609375, Test Loss: 0.06895151734352112, Test Accuracy: 97.82999420166016
Epoch 2, Loss: 0.04049573093652725, Accuracy: 98.72000122070312, Test Loss: 0.04830228164792061, Test Accuracy: 98.38999938964844
Epoch 3, Loss: 0.019145004451274872, Accuracy: 99.39666748046875, Test Loss: 0.050554048269987106, Test Accuracy: 98.37999725341797
Epoch 4, Loss: 0.012602318078279495, Accuracy: 99.55833435058594, Test Loss: 0.055954527109861374, Test Accuracy: 98.27999877929688
Epoch 5, Loss: 0.007870126515626907, Accuracy: 99.7316665649414, Test Loss: 0.073118194937706, Test Accuracy: 98.27999877929688


Preparing the data

In [0]:
#First, we are going to transform data type from integer to float in order to normalize after that to get all values between 0 and 1
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

x_train /= 255
x_test /= 255

In [0]:
#Now we are going to change the different images to concat all their lines to transform them to a single line
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)

print(x_train.shape)
print(x_test.shape)

In [0]:
#Let's use keras method "to_categorical"
from tensorflow.keras.utils import to_categorical

y_train = to_categorical(y_train, num_classes=10)
y_test = to_categorical(y_test, num_classes=10)

Defining the model

In [0]:
#Our model will be sequential with a conv2d lay and relu activation, another lay which will set in one dimension the result of first lay (no parameters) and finally two dense lays
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Flatten

model = Sequential()
model.add(Conv2D(32, 3, activation='relu', input_shape=(30, 30, 1)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(10))

In [0]:
model.summary()