# Data Normalization
We can further improve on this. The image data loaded by the mnist() module is in raw format; each image is a 28 × 28 matrix of integer values from 0 to 255. If you were to inspect the parameters (weights and biases) within a trained model, they are very small numbers, typically from –1 to 1. Generally, when data feeds forward through the layer and the parameters of one layer are matrix-multiplied against parameters at the next layer, the result is a very small number.
 The problem with our preceding example is that the input values are substantially larger (up to 255), which will produce large numbers initially as they are multiplied through the layers. This will result in taking longer for the parameters to learn their optimal values—if they learn them at all

## Normalization
We can increase the speed at which the parameters learn the optimal values and increase our chances of convergence (discussed subsequently) by squashing the input values into a smaller range.

One simple way to do this is to squash them proportionally into a range from 0 to 1. We can do this by dividing each value by 255

 In the following code, we add the step of normalizing the input data by dividing each pixel value by 255. The load_data() function loads the dataset into memory in a NumPy format.


By default, NumPy does floating-point operations as double precision (64 bits). By default, the parameters in a TF.Keras model are single-precision floating-point (32 bits). For efficiency, as a last step, we convert the result of the broadcasted division to 32 bits by using the NumPy astype() method. If we did not do the conversion, the initial matrix multiplication from the input-to-input layer would take double the number of machine cycles (64 × 32 instead of 32 × 32).


In [7]:
from keras.layers import Flatten, Dense
from keras import Sequential
import numpy as np
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [8]:
"""
If we did not do the conversion, the initial matrix multiplication from the input-to-input layer would take double the number of machine cycles (64 × 32 instead of 32 × 32).
"""
x_train = (x_train/255.0).astype(np.float32)
x_test = (x_test/255.0).astype(np.float32)

In [10]:
model = Sequential([
    Flatten(input_shape=(28,28)),
    Dense(512, activation="relu"),
    Dense(512, activation="relu"),
    Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics="acc")
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x14cccb692b0>

Let’s now evaluate our model by using the evaluate() method on the test (holdout) data to see how well the model will perform on data it has never seen during training. The evaluate() method operates in inference mode: the test data is forward-fed through the model to make predictions, but there is no backward propagation. The model’s parameters are not updated. Finally, evaluate() will output the loss and over all accuracy:

In [13]:
test_loss, test_acc = model.evaluate(x_test, y_test)
print("test_loss, test_acc: ", test_loss, test_acc)


test_loss, test_acc:  0.10678740590810776 0.9787999987602234


## Standardization

There are a variety of ways to squash the input data beyond the normalization used in the preceding example. For example, some ML practitioners prefer to squash the input values between –1 and 1 (instead of 0 and 1), so that the values are centered at 0. The following code is an example implementation that divides each element by one-half the maximum value (in this example, 127.5) and then subtracts 1 from the result:


In [14]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = (x_train/127.5).astype(np.float32)
x_test = (x_test/127.5).astype(np.float32)
model = Sequential([
    Flatten(input_shape=(28,28)),
    Dense(512, activation="relu"),
    Dense(512, activation="relu"),
    Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics="acc")
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x14cc7e3e1f0>

In [15]:
test_loss, test_acc = model.evaluate(x_test, y_test)
print("test_loss, test_acc: ", test_loss, test_acc)

test_loss, test_acc:  0.11948800086975098 0.978600025177002


Does squashing the values between –1 and 1 produce better results than between 0 and 1? I haven’t seen anything in the research literature, or my own experience, that indicates a difference.   This and the previous method don’t require any pre-analysis of the input data, other than knowing the maximum value. Another technique, called **standardization**, is considered to give a better result. However, it requires a pre-analysis (scan) over the entire input data to **find its mean and standard deviation**. You then **center the data at the mean of the full distribution** of the input data and **squash the values between +/– one standard deviation**

In [16]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_mean =  np.mean(x_train)
x_std = np.std(x_train)
x_train = ((x_train - x_mean) / x_std).astype(np.float32)
model = Sequential([
    Flatten(input_shape=(28,28)),
    Dense(512, activation="relu"),
    Dense(512, activation="relu"),
    Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics="acc")
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x14cc7736280>