# Data Splitting
A dataset is a collection of examples that are large and diverse enough to be representative of the population being modeled (the sampling distribution). When a dataset meets this definition and is cleaned (not noisy), and in a format that’s ready for machine learning training, we refer to it as a **curated dataset**.


 A wide variety of curated datasets are available for academic and research purposes. Some of the well-known ones for image classification are MNIST (introduced in chapter 2), CIFAR-10/100, SVHN, Flowers, and Cats vs. Dogs. MNIST and CIFAR10/100 (Canadian Institute for Advanced Research) are built into the TF.Keras frame work. SVHN (Street View Home Numbers), Flowers, and Cats vs. Dogs are available with TensorFlow Datasets (TFDS). Throughout this section, we will be using these datasets for tutorial purposes.

Once you have a curated dataset, the next step is to split it into examples that will be used for training and those that will be used for testing (also called evaluation or holdout). We train the model with the portion of the dataset that is the training data. If we assume the training data is a good sampling distribution (representative of the population distribution), the accuracy of the training data should reflect the accuracy when deployed to the real-world predictions on examples from the population not seen by the model during training.

## Training and test sets
What is important is that we are able to assume our dataset is sufficiently large enough that if we split it into 80% and 20%, and the examples are randomly chosen so that both datasets will be good sampling distributions representative of the population distribution, the model will make predictions (inference) after it’s deployed. Figure 4.3 illustrates this process

<img src="img_2.png">

In [31]:
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(60000, 28, 28) (60000,)
(10000, 28, 28) (10000,)


In [2]:
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(60000, 28, 28) (60000,)
(10000, 28, 28) (10000,)


## One-hot encoding
Let’s build a simple DNN to train our curated dataset. In the next code example, we start by flattening the 28-×-28-image input into a 1D vector by using the Flatten layer, which is then followed by two hidden Dense() layers of 512 nodes each, each using the convention of a relu activation function. Finally, the output layer is a Dense layer with 10 nodes, one for each digit. Since this is a multiclass classifier, the activation function for the output layer is a softmax. Next, we compile the model for the convention for multiclass classifiers by using categorical_crossentropy for the loss and adam for the optimizer:



In [20]:
from keras.layers import Flatten, Dense
from keras import Sequential

In [32]:
model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(512, activation="relu"),
    Dense(512, activation="relu"),
    Dense(10, activation="softmax")
])
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])

In [23]:
# you will see an error message if running the below code:
model.fit(x_train, y_train)

ValueError: in user code:

    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 860, in train_step
        loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\training.py", line 918, in compute_loss
        return self.compiled_loss(
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\engine\compile_utils.py", line 201, in __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\losses.py", line 141, in __call__
        losses = call_fn(y_true, y_pred)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\losses.py", line 245, in call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\losses.py", line 1789, in categorical_crossentropy
        return backend.categorical_crossentropy(
    File "C:\Users\HOME\anaconda3\lib\site-packages\keras\backend.py", line 5083, in categorical_crossentropy
        target.shape.assert_is_compatible_with(output.shape)

    ValueError: Shapes (32, 1) and (32, 10) are incompatible


In [33]:
from tensorflow.keras.utils import to_categorical
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

In [34]:
model.fit(x_train, y_train)



<keras.callbacks.History at 0x2535f9be940>

**The accuracy on the training data is just over 90%.**

<img src=img_3.png>

That works, and we got 90% accuracy on the training data—but we can simplify this step. The compile() method has one-hot encoding built into it. To enable it, we just change the loss function from categorical_crossentropy to sparse_categorical_crossentry.

In this mode, the loss function will receive the labels as scalar values and dynamically convert them to one-hot-encoded labels before performing the crossentropy loss calculation


In [37]:
model2 = Sequential([
    Flatten(input_shape=((28, 28))),
    Dense(512, activation="relu"),
    Dense(512, activation="relu"),
    Dense(10, activation="softmax")
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])

(x_train2, y_train2), (x_test2, y_test2) = mnist.load_data()

model.fit(x_train2, y_train2,  epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2536a258910>